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20.  ABSTRACT  (Continued) 

*-of  the  rigid  motion  parameters  from  retinal  motion  are  investigated.  The 
emphasis  is  on  determining  the  possible  ambiguity  of  interpretation  and  how 
to  remove  them.  This  theoretical  analysis  forms  the  basis  of  a  set  of 
algorithms  for  computing  structure  and  three  dimensional  motion  parameters 
from  retinal  displacements.  The  algorithms  are  experimentally  evaluated. 

The  main  difficulties  facing  the  computation  are  seen  to  be  nonlinearity  and 
a  high  dimensional  search  space  of  solutions.  To  alleviate  these  difficulties 
an  active  tracking  method  is  proposed.  This  is  a  closed  loop  system  for 
evaluating  the  motion  parameters.  It  is  shown  that  under  such  a  regime  it  is 
possible  to  obtain  closed  form  solutions  for  the  motion  parameters.  This 
leads  to  a  robust  cooperative  algorithm  for  motion  perception  requiring 
minimal  amount  of  retinal  motion  matching.  The  central  theme  for  this 
research  has  been  the  evaluation  of^hierarchical  model  for  visual  motion 
perception.  , To  this  end,  the  investigations  revolved  around  three  primary 
issues:  (aj  retinal  motion  computations  from  intensity  images;  (b)  the 
conditions  under  which  three  dimensional  motion  may  be  computed  from  retinal 
motion,  and  the  efficacy  of  algorithms  that  perform  such  computation;  (c) 
the  active  vision  or  closed  loop  approach  to  visual  motion  interpretation  and 
what  it  buys  us.  This  thesis  records  fundamental  contributions  pertaining  to 
the  above  questions. 
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Abstract 


The  interpretation  of  visual  motion  is  investigated.  The  task  of  motion 
perception  is  divided  into  two  major  subtasks:  (i)  estimation  of  two 
dimensional  retinal  motion,  and  (ii)  computation  of  parameters  of  rigid 
motion  from  retinal  motion.  Retinal  motion  estimation  is  performed  using 
a  point  matching  algorithm  based  on  local  similarity  of  matches  and  a  global 
clustering  strategy.  The  clustering  technique  unifies  the  notion  of  matching 
and  motion  segmentation  and  provides  an  insight  into  the  complexity  of  the 
matching  and  segmentation  process.  The  constraints  governing  the 
computation  of  the  rigid  motion  parameters  from  retinal  motion  are 
investigated.  The  emphasis  is  on  determining  the  possible  ambiguity  of 
interpretation  and  how  to  remove  them.  This  theoretical  analysis  forms  the 
basis  of  a  set  of  algorithms  for  computing  structure  and  three  dimensional 
motion  parameters  from  retinal  displacements.  The  algorithms  are 
experimentally  evaluated.  The  main  difficulties  facing  the  computation  are 
seen  to  be  nonlinearity  and  a  high  dimensional  search  space  of  solutions.  To 
alleviate  these  difficulties  an  active  tracking  method  is  proposed.  This  is  a 
dosed  loop  system  for  evaluating  the  motion  parameters.  It  is  shown  that 
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under  such  a  regime  it  is  possible  to  obtain  closed  form  solutions  for  the 
motion  parameters.  This  leads  to  a  robust  cooperative  algorithm  for  motion 
perception  requiring  minimal  amount  of  retinal  motion  matching.  The 
central  theme  for  this  research  has  been  the  evaluation  of  a  hierarchical 
model  for  visual  motion  perception.  To  this  end,  the  investigations 
revolved  around  three  primary  issues:  (a)  retinal  motion  computation  from 
intensity  images;  (b)  the  conditions  under  which  three  dimensional  motion 
may  be  computed  from  retinal  motion,  and  the  efficacy  of  algorithms  that 
perform  such  computation;  (c)  the  active  vision  or  closed  loop  approach  to 
visual  motion  interpretation  and  what  it  buys  us.  This  thesis  records 
fundamental  contributions  pertaining  to  the  above  questions. 
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Chapter  One 


Introduction 


1.1.  The  Motion  Perception  Problem 

Our  visual  perception  creates  awareness  of  the  world  around  us,  that 
consists  of  a  rich  variety  of  objects.  These  objects  are  characterized  by 
different  shapes,  colors  and  motion  patterns.  The  visual  data  (or  stimulus) 
that  is  captured  by  the  eyes  is  in  essence  two  dimensional  patterns  of  light 
reflected  from  the  surfaces,  normally  of  solid  rigid  objects,  that  exist  in  our 
environment.  When  we  ponder  the  complexity  of  the  three  dimensional 
scene  surrounding  us,  it  is  apparent  that  despite  the  unconscious  ease  with 
which  our  brain  interprets  the  visual  data  available  to  it,  Visual  Perception  is 
a  complicated  task.  Two  of  the  problems  associated  with  actually  “seeing” 
in  three  dimensions  are  immediately  apparent.  First,  the  images  formed  on 
the  retina  of  the  eyes  are  two  dimensional,  thus  three  dimensional 
information  is  only  implicit.  Second,  the  retinal  images  are  continually 
changing,  due  to  the  movement  of  objects  we  are  seeing,  or  due  to  our  own 


movement. 
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To  relate  retinal  images  to  object  models  natural  constraints  in  the  world 
must  be  used.  The  retinal  stimulus  contains  implicit  information  that  can  be 
used  to  recover  aspects  of  the  three  dimensional  world  being  viewed.  There 
are  two  images  formed  on  the  retinas  of  the  two  eyes,  which  are  spatially 
displaced  from  each  other.  The  principle  of  stereoscopic  fusion  (or 
triangulation)  of  the  images  of  a  point  in  space  to  compute  its  depth  has 
been  recognized  for  a  long  time.  There  is  also  the  intimate  relationship 
between  the  local  surface  shape  (e.g.  slant  and  tilt)  and  motion  induced 
change  in  the  retinal  intensity  pattern. 

This  thesis  is  concerned  with  one  particular  set  of  visual  constraints, 
namely  those  having  to  do  with  motion.  The  problem  under  study  concerns 
the  task  of  computation  of  the  three  dimensional  motion  between  an 
observer  and  rigid  objects. 

It  is  now  widely  accepted  that  motion  is  a  fundamental  sense  or 
modality,  that  is  extracted  from  the  visual  stimulus  array  (see  [63]).  This 
computational  study  of  motion  will  be  restricted  to  stimuli  that  contain 
information  primarily  about  spatio-temporal  variations  in  the  image  intensity 
distribution.  However,  it  is  useful  to  bear  in  mind  that  submodalities  like 
depth  and  surface  orientation  can  prove  helpful  in  analyzing  the  motion 
understanding  process.  On  the  other  hand  color  information  is  assumed  to 
play  a  minimal  role  in  the  perception  of  motion.  Therefore,  the  visual  input 
that  is  considered  useful  is  monochromatic  images  from  either  one  or  both 
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eyes. 

The  computational  goal  of  the  motion  perception  process  is  to  obtain 
estimates  for  the  three  dimensional  velocities  of  the  objects  being  observed. 
The  latter  quantities  are  also  termed  relative  motion  parameters  and  are 
global  attributes  of  the  moving  bodies.  In  a  theoretical  sense  there  is  not 
much  difference  in  analyzing  a  scene  containing  a  single  moving  object  and 
another  containing  multiple  objects  in  motion.  In  the  latter  case,  the  motion 
analysis  must  first  perform  segmentation  or  break  up  the  two  dimensional 
image  into  the  various  regions  corresponding  to  the  different  object  surfaces. 
Subsequent  to  this,  the  individual  segments  can  be  treated  separately  as 
image  fragments  dealing  with  single  body  motion. 

In  subsequent  portions  of  this  document,  unless  otherwise  specified, 
the  treatment  of  three  dimensional  motion  interpretation  deals  with 
egomotion.  This  is  the  situation  where  the  motion  stimuli  are  generated  due 
to  the  movement  of  the  observing  system  in  a  static  visual  environment. 
The  reasons  for  this  simplification  are 

(i)  The  two  dimensional  motion  estimation  algorithm  that  is  proposed  in 
chapter  two  can  handle  motion  segmentation  and  hence  subsequent 
analysis  need  not  deal  with  more  than  one  moving  surface. 


(ii)  Mathematically,  there  is  no  difference  between  the  motion  stimuli  due 
to  a  static  observer,  whose  entire  visual  field  registers  motion  due  to 
one  moving  object,  and  that  for  a  moving  observer  registering  the 
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relative  motion  of  the  static  surround. 

The  problem  addressed  in  this  thesis  can  thus  be  stated  as: 

Problem  Definition:  Given  monocular  or  binocular  spatio-temporally 
varying  images  and  known  viewing  parameters,  to  compute  egomotion 
parameters  and  structure  of  the  imaged  scene. 

The  viewing  parameters  referred  to  in  the  above  definition  are  the  focal 
length,  image  scaling  factors  and  the  relative  locations  and  orientations  of 
the  two  cameras  (in  case  of  binocular  imagery). 

Traditional  approaches  to  this  problem  have  made  two  different  kinds 
of  assumptions  when  compared  to  the  methodology  advocated  here.  The 
first  of  these  is  that  the  monocular  stimulus  should  be  enough  to  compute 
the  motion  parameters.  The  theoretical  basis  of  this  belief  will  be  explored 
to  evaluate  how  well  monocular  data  can  be  used  to  aid  the  perception 
process.  It  will  be  seen  that  the  problem  is  beset  with  two  principal 
difficulties,  namely:  ( i) nonlinearity  and  (ii)  high  parameter  space  dimension 

The  above  difficulties  make  computer  algorithms  for  motion  computation 
complex  and  sensitive  to  errors  in  two  dimensional  retinal  motion 
measurement  [84]. 

Almost  all  previous  work  is  based  on  the  assumption  that  the  motion 
problem  can  be  solved  with  passive  observation.  In  passive  observation,  the 
sensors  (cameras)  are  rigidly  attached  to  the  body  in  motion.  Since  motion 
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is  relative,  one  can  always  assume  that  the  sensor  is  fixed  and  the 
environment  moves.  Thus  passive  navigation  deals  with  the  measurement  of 
motion  with  respect  to  a  static  sensing  system.  This  problem  assumes  that 
the  alignment  of  the  camera  axis  with  respect  to  the  three  dimensional 
velocity  directions  is  arbitrary  and  fixed.  In  general  the  solution  can  be 
shown  to  be  dependent  upon  nonlinear  equations  of  large  dimensions  [10, 
64].  There  is  no  reason  to  believe  that  passive  navigation  can  lead  to 
efficient  and  robust  solutions  to  the  problem  of  motion  perception.  In  fact  it 
will  be  shown  that  such  methods  have  inherent  ambiguities  in  so  far  as 
motion  interpretation  from  two  dimensional  retinal  cues  is  concerned. 

An  alternative  approach  to  the  problem,  can  be  based  upon  the 
assumption  that  the  alignment  of  the  camera  axes  are  controllable  by  the 
observer.  In  this  case,  as  the  observer  continues  to  move  in  the  world,  the 
orientations  of  the  eyes  ( cameras)  are  continually  adjusted.  This  adjustment 
is  dependent  upon  the  two  dimensional  motion  perceived  on  the  retina,  and 
serves  -  among  other  things  which  will  be  explained  later  -  to  simplify  the 
constraints  governing  the  perception  of  the  motion.  This  is  the  mechanism 
of  active  navigation  that  will  be  explored  subsequently. 

The  goal  of  this  dissertation  is  to  explore  computational  solutions  to  the 
problem  of  rigid  motion  perception.  A  key  orientation  of  this  research  has 
been  to  derive  inspiration  for  the  structure  of  the  computer  model  from 
relevant  known  attributes  of  the  primate  visual  system.  This  knowledge, 
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together  with  some  of  the  emerging  concepts  in  computer  and  cognitive 
science  regarding  highly  parallel  computational  models  [31,  32]  and 
parameter  estimation  and  transformation  [7,  9,  17,  18]  motivated  the 
proposed  computational  scheme.  Before  elaborating  on  this  model  some 
aspects  of  the  biological  vision  mechanisms  are  examined. 

1.2.  The  Primate  Visual  System. 

This  section  examines  some  interesting  attributes  of  biological  vision. 
The  account  is  not  meant  to  present  a  comprehensive  picture  of  neural 
visual  processing.  Rather,  the  aim  is  to  highlight  important  neurobiological 
features  that  have  strong  computational  advantages,  and  form  an  important 
motivation  for  the  motion  model  proposed  in  this  thesis. 

1.2.1.  Abstraction  Hierarchies 

The  cortex,  which  is  the  outermost  portion  of  the  brain,  can  be  roughly 
regarded  as  a  two  dimensional  sheet,  a  few  millimeters  in  thickness.  This 
sheet  consists  of  gray  matter,  which  are  the  neuronal  computing  units  and 
white  matter,  which  constitute  the  mass  of  fibers  that  the  neuronal  units  use 
to  communicate  with  each  other. 

Neuroscientists  have  been  able  to  partition  the  cortex  into  a  number  of 
distinct  areas.  The  notable  property  that  emerges  is  that  of  uniformity  of 
the  processing  architecture,  coupled  with  the  functional  diversity  of  the 
different  areas  [54].  The  primary  visual  areas  in  the  striate  cortex  are 
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retxnotopic*  This  means  that  they  encode  information  in  the  visual  field 
indexed  by  two  dimensional  retinal  coordinates.  Thus  for  example  a  bright 
spot  of  light  shone  at  a  particular  angular  position  in  the  visual  field  will 
affect  only  those  units  that  are  responsible  for  the  given  retinal  position. 

There  is  good  evidence  to  suggest  that  different  cortical  areas  compute 
and  represent  information  at  different  levels  of  abstraction  [26].  An  indication 
of  this  is  provided  by  an  experiment  by  Movshon  [50],  which  compared  the 
responses  of  neurons  in  areas  VI  and  MT  in  the  macaque  monkey.  Given  a 
checkerboard  stimulus,  neurons  in  VI  responded  optimally  when  the  motion 
was  perpendicular  to  the  intensity  gradients  of  the  checkers.  This  behavior  is 
isotropic  with  respect  to  the  orientation  of  the  intensity  gradient  and  only 
depends  upon  the  magnitude  of  the  temporal  intensity  change,  which  is 
maximum  when  the  motion  is  perpendicular  to  the  intensity  gradient  of  the 
checkers.  On  the  other  hand  when  some  of  the  neurons  in  the  MT  were 
probed,  the  responses  indicated  that  each  had  its  own  preferred  direction  of 
motion. 

An  interpretation  of  the  above  could  be  that  the  VI  neurons  are 
involved  in  the  computation  of  temporal  change  in  image  intensity,  while 
the  MT  units  code  optical  flow,  which  is  the  retinotopic  projection  of  the 
three  dimensional  velocity  field.  Such  indication  of  different  abstraction 
levels  were  first  observed  by  Hubei  and  Wiesel  [46,  47],  who  postulated  a 
hierarchical  functional  architecture  for  visual  processing  with  successively 
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more  and  more  abstract  neuronal  units  which  they  called  simple,  eomplex  and 
hypereomplex  cells  respectively. 

In  the  context  of  motion  perception,  one  can  think  of  parameters  that 
are  at  higher  levels  of  abstraction.  For  instance,  in  the  case  of  rigid  body 
motion,  an  economical  representation  is  provided  by  global  (i.e.  non 
retinotopic),  parameters  such  as  translation  and  rotation.  In  fact,  Sakata 
[74]  has  identified  neurons  that  respond  to  full-field  rotations,  in  the  parietal 
cortex. 

It  seems  that  there  exists  a  motion  processing  hierarchy  in  the  primate 
brain.  This  information  processing  pathway  includes  the  primary  visual 
cortex  (area  VI),  the  middle  temporal  visual  area  (MT),  the  medial  superior 
temporal  visual  area  (MST),  and  the  parietal  cortex  (area  &a)  [26].  The 
parietal  cortex  and  area  MST  are  layers  in  the  motion  hierarchy  that  appear 
to  compute  high  level  motion  features.  While,  the  area  MT  seems  to 
compute  lower  level  retinotopic  (i.e.  two  dimensional)  motion 
representations. 

The  foregoing  discussion  highlights  some  important  design  methodologies 
in  the  biological  hardware.  These  have  to  do  with  massive  parallelism, 
computation  in  hierarchies  and  successive  invariance  levels  characterized  by 
their  own  parameter  sets  [10].  The  lessons  drawn  from  these  attributes  will 
be  elaborated  subsequently.  However,  before  doing  that  we  will  take  a  look 
at  an  important  control  principle  in  the  motion  processing  system. 
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1.2.2.  Smooth  Pursuit  Eye  Movements 

While  it  is  true  that  not  all  creatures  with  eyes  are  able  to  move  them, 
those  that  do,  do  so  in  order  to  see  better.  One  of  the  problems  with  visual 
perception  by  a  moving  observer  is  that  the  motion  induces  blurring.  As  a 
quantitative  estimate  one  may  calculate  that  a  target  movement  as  slow  as 
l°/s,  when  any  point  on  the  target  takes  about  three  minutes  to  cross  the 
visual  field,  has  roughly  the  same  effect  on  resolution  as  three  diopters  of 
myopia  [93].  Thus  it  is  readily  seen  that  one  of  the  reasons  for  the  ability 
of  the  eyes  to  rotate  with  respect  to  the  head  is  to  stabilize  the  moving 
retinal  image. 

This  method  of  compensation  has  its  limitations,  however,  since  the 
eyes  cannot  displace  themselves  with  respect  to  the  head  up  to  any 
significant  degree.  Therefore,  since  the  rotational  movement  has  its  limits, 
there  are  two  types  of  eye  movements,  both  rotational.  The  first  is  called 
optokinests  or  smooth  pursuit  This  is  a  relatively  slow  and  continuous 
movement,  whereby  the  image  of  a  small  target  can  be  held  steady  on  the 
central  part  of  the  retina.  This  is  the  tracking  movement  we  are  primarily 
interested  in.  The  second  type  of  movement  is  called  a  saccade,  whereby 
the  eyes  execute  a  'catching*  movement  to  position  the  image  of  the  object 
on  the  central  part  of  the  retina.  The  velocity  with  which  this  movement  is 
executed  is  quite  large,  being  of  the  order  of  l000°/s  [72]. 

t 
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For  a  human,  the  pattern  of  eye  movements  is  likely  to  be  a  sequence 
of  saccades  with  smooth  pursuit  movement  in  between.  This  pattern  is 
called  optokinetic  nystagmus.  Consider  for  instance,  a  jogger  running  at  a 
relatively  steady  pace.  His  eyes  are  continually  moving  according  to  the 
following  steady  pattern  ([22]): 

(1)  A  target  feature  is  selected  in  the  environment. 

(2)  A  saccade  is  made  to  catch  the  target  and  align  its  image  with  the 
optical  axis. 

(3)  A  smooth  pursuit  movement  of  the  eyes  takes  place,  where  the  target 
is  tracked  and  held  steady.(i.e.  the  retinal  slip  is  kept  as  near  to  null  as 
possible) 

(4)  When  the  rotational  displacement  of  the  eyes,  reaches  some  limit 
another  suitable  target  is  selected  at  the  periphery  of  the  visual  field 
and  steps  (2)  to  (4)  are  repeated. 

The  above  behavior  pattern,  seems  to  support  the  claim,  put  forth  by 
Cutting,  that  the  pursuit  system  plays  a  cardinal  role  in  our  ability  to 
navigate  in  a  cluttered  environment. 

The  most  interesting  aspect  of  the  smooth  pursuit  or  tracking  system  is 
that  it  illustrates  an  active  principle  in  human  visual  processing.  In  other 
words,  the  system  is  closed  loop.  This  point  is  worthy  of  reiteration,  since  it 
is  eminently  sensible,  even  from  a  system  theoretic  point  of  view  to  to 
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design  measurement  mechanisms  that  adapt  to  the  changes  in  stimulus.  It 
will  be  shown  subsequently  that  there  are  good  reasons  for  designing 
computer  models  for  motion  perception  in  a  similar  manner. 

Experimental  evidence  indicates  that  the  primate  pursuit  mechanism 
works  best  when  the  retinal  target  velocity  is  not  more  than  30° /s  [73].  Two 
other  quantitative  performance  parameters  of  this  mechanism  that  are 
relevant  are  the  information  processing  latency  within  the  control  loop, 
which  is  around  100  ms.  and  the  tracking  error  which  is  found  to  be  well 
within  10  percent  of  the  target  velocity. 

In  summary,  it  should  be  mentioned  that  an  active  method  forms  a 
dominant  principle  in  the  motion  perception  scheme  in  primates,  and 
furthermore: 

(i)  The  pursuit  hardware  is  an  integral  part  of  the  motion  processing 
pathways.  (Recall  Cutting’s  observations  on  how  tracking  facilitates 
navigation). 

(ii)  The  selection  of  the  target  to  be  tracked  depends  on  image  features 
such  as  luminance,  size  of  target  (smaller  the  better),  position  in  the 
visual  field  (smaller  eccentricities  preferred)  and  velocity.  Although 
small  punctate  targets  are  preferred,  humans  can  take  advantage  of 
aggregate  motion  to  pursue  targets  that  are  perceivable  but  not  visible. 
An  example  is  the  ability  to  track  the  center  of  a  rolling  wheel  that  is 
marked  only  by  several  small  lamps  attached  to  its  rim  [77]. 
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(iii)  The  100  ms  loop  latency  seems  to  be  divided  into  two  time  steps.  The 
first  step,  of  about  40  ms  duration,  is  distinguished  by  the  fact  that  the 
system  seems  to  consider  only  the  direction  of  target  motion  for  its 
computations  but  not  the  speed. 

1.3.  Computer  Vision,  Connect!  onism,  Biology  and  Computational 
Structure 

The  study  of  machine  vision  systems  cannot  ignore  the  fact  that  most 
of  the  tasks  that  one  sets  out  to  solve  are  modeled  after  biological  vision. 
Studies  of  the  human  information  processing  system  are  affecting  the  design 
of  machine  models  for  similar  tasks  in  most  radical  ways,  as  computer  and 
cognitive  scientists  are  increasingly  becoming  aware  of  the  fact  that 
conventional  stored  program  concepts  inhibit  the  formulation  of  cognitive 
tasks  in  a  fast,  robust,  adaptive,  fault  tolerant  manner  [31]. 

The  foregoing  sketch  of  some  aspects  of  the  primate  visual  system  has 
served  to  provide  a  rationale  for  us  to  inquire  whether  it  is  a  good  idea  to 
incorporate  concepts  from  Nature  into  computer  models.  The  specific  task 
at  hand  is  motion  interpretation.  This  section  will  discuss  the  design 
decisions  that  were  made  regarding  the  structure  of  the  proposed  computer 
model  for  motion  perception. 

The  study  of  computer  vision  is  conventionally  divided  into  three 
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(i)  Low  Level  Vision:  This  stage  is  concerned  with  early  processing  of  visual 
information.  This  level  is  characterized  by  the  local  nature  of  the 
computations  performed.  In  the  vision  mechanism  of  primates,  this 
stage  encompasses  the  visual  processing  at  the  retina  and  continued  till 
the  primary  visual  cortex.  In  computer  vision,  examples  are  provided 
by  computations  connected  with  the  formation  of  the  primal  sketch 
representation  of  Marr  [57]  or  Feldman’s  retinotopic  frame  [32]. 
Operations  at  this  level  are  exemplified  by  filtering,  convolution  and 
relaxation  based  on  local  constraints. 

(ii)  Intermediate  Level  Vision:  This  level  of  processing  is  characterized  by 
two  major  endeavors,  namely  segmentation  and  the  computations  of 
parameters  that  signal  regional  invariance  characteristics.  Here  the  main 
task  is  to  compute  stimulus  representations  that  will  be  used  in  the  next 
level  of  visual  tasks.  One  characteristic  of  encodings  at  this  level  is  that 
they  are  intrinsic  properties  of  the  viewed  objects,  and  are  independent 
of  particular  viewing  conditions.  Examples  of  representations  at  this 
level  are,  optical  flow,  field  of  surface  normals  corresponding  to  visible 

surfaces  [15].  This  level  is  also  exemplified  by  Marr’s  2 ~D  sketch  and 

• 

Feldman’s  stable  feature  frame.  Segmentation  is  an  operation  that  is 
quite  crucial  at  this  level,  because  of  the  need  to  separate  out  image 
regions  corresponding  to  different  objects  or  moving  surfaces.  This 
separation  is,  invariably,  a  difficult  task  but  serves  to  simplify  higher 
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levels  of  computation  by  preventing  effects  of  independent  phenomena 
from  interfering  with  each  other  in  the  analysis  (this  latter  problem  is 
sometimes  referred  to  as  crosstalk  in  connectionist  literature). 
Segmentation  and  parameter  estimation  at  this  stage  often  involve 
interacting  goals.  It  is  likely  that  cooperative  computational  algorithms 
and  constraints  derived  from  higher  level  computational  layers  are 
likely  to  facilitate  the  processing  at  this  stage. 

(iii)  High  Level  Vision  This  is  the  level  of  symbolic  information  processing. 
The  central  task  at  this  level  is  concerned  with,  what  has  been  called 
the  indexing  problem  (30] .  This  is  the  problem  of  deriving  the 
description  of  a  situation  from  a  set  of  visual  features  computed  at  the 
lower  levels.  At  this  level  methods  for  knowledge  representation, 
storage  and  retrieval  are  of  crucial  importance.  An  example  of  a  model 
of  computations  can  be  found  in  [32],  where  a  dynamically  modified 
store  of  objects  and  relations  in  the  observer’s  extrapersonal  space 
called  an  environmental  frame,  interacts  with  a  more  permanent 
repository  of  world  knowledge  called  the  world  knowledge  formulary. 

The  task  addressed  by  this  thesis  spans  the  first  two  levels  of  the  above 
hierarchy.  The  computation  of  motion  parameters  begins  from  a  time 
varying  sequence  of  images.  One  can  imagine  this  input  to  be  akin  to  a 
number  of  consecutive  frames  of  a  movie  or  video  sequence.  Clearly,  there 
is  a  difference  between  the  natural  visual  input  to  our  eyes  and  this  spatio- 
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temporally  sampled  stimuli  that  the  machine  vision  system  will  have  to  work 
with.  However,  it  will  be  argued  in  chapter  two  that  in  general  this 
difference  will  not  substantially  alter  our  approach  to  the  problem  at  hand. 

The  first  and  most  basic  question  that  will  have  to  be  asked  is  what 
form  the  computation  is  going  to  take.  There  is  a  choice  here  since  one 
could  conceivably  attempt  direct  computation  of  the  motion  parameters 
from  the  intensity  function  and  its  spatio  temporal  derivatives  [2,  64].  This 
method  has  been  proposed  recently  in  restricted  cases,  like  motion  of  planes 
or  pure  rotational  motion.  A  generally  applicable  strategy  according  to  this 
approach  seems  difficult  and  is  yet  to  emerge.  The  other  alternative 
approach,  which  is  adopted  here,  is  modeled  after  the  abstraction  hierarchy 
idea  encountered  previously.  The  two  schemes  are  shown  in  figure  1.1. 

As  mentioned  before,  the  chosen  avenue  for  the  investigations  is 
motivated  by  the  connectionist  paradigm  and  the  associated  notion  that  the 
computation  is  structured  so  as  to  compute  successive  invariant  levels 
characterized  by  a  small  set  of  parameters,  as  in  the  parameter  net  formalism 
of  Ballard  [9].  Such  a  methodological  orientation  dictates  that  the  constraint 
relations  between  the  parameters  in  adjacent  layers  be  kept  as  simple  as 
possible.  In  addition,  it  is  desirable  to  minimize,  as  far  as  possible,  the  size 
of  the  parameter  sets  describing  the  invariants  at  each  layer.  (Later,  this 
cardinality  is  referred  to  as  the  dimensionality  of  the  parameter  space 
corresponding  to  a  particular  invariance  layer,  a  usage  whose  purpose  will 
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figure  1.1  Alternative  models  for  Motion  Analysis 


become  clear  when  we  introduce  the  notion  of  the  hough  transform  as  a 
general  computational  paradigm  for  parallel  algorithms).  The  reason  is  that 
when  we  envisage,  a  highly  parallel  implementation  of  a  computational 
scheme  in  the  connectionist  form  proliferation  of  dimensions  and 
complexity  of  constraints  cause  exponential  growth  in  units  and  connections. 

Now  it  is  possible  to  answer  the  question  as  to  whether  all  the  layers  in 
the  proposed  computer  model  are  really  necessary.  Notice  that  in  the  direct 
computational  model  one  is  constrained  to  handle  the  motion  and  structure 
parameters  together,  necessitating  higher  dimensional  parameter  spaces  and 
complex  constraint  relations  for  the  parameter  computation.  On  the  other 
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hand,  the  layered  model  can  be  shown  to  deal  with  smaller  parameter  spaces 
and  simpler  constraints  linking  them  (see  chapter  three).  The  demonstration 
of  this  fact  was,  in  fact,  a  principal  goal  of  this  thesis. 

The  next  section  will  paraphrase  the  subject  matter  of  each  of  the 
chapters  of  this  dissertation  and  indicate  the  contributions  made.  Some  idea 
of  the  nature  of  the  various  layers,  with  reference  to  figure  1.1  will  also  be 
given. 

1.4.  Outline  of  the  Dissertation 

The  structure  of  the  proposed  computer  model  for  motion  perception  is 
given  in  figure  1.1.  Chapter  two  deals  with  the  computation  of  image 
motion.  The  computation  is  basically  achieved  in  two  stages,  involving  the 
computation  of  image  features  using  local  image  filtering,  followed  by  a 
cluster  based  matching  and  segmentation  algorithm  for  estimating  the  image 
motion.  The  next  chapter  looks  at  the  constraints  governing  the 
computation  of  the  rigid  motion  parameters  and  structure.  Investigation 
centers  around  determination  of  the  nature  of  the  computation,  ways  of 
segmenting  the  structure  and  motion  parameter  computation,  and  ways  of 
resolving  interpretation  ambiguities  when  they  arise.  Chapter  four  details 
algorithms  for  motion  perception  from  computed  image  motion.  The 
constraints  used,  are  derived  based  on  the  analysis  of  chapter  three.  The 
overall  computational  paradigm  employed  for  this  proposed  algorithm  is 
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called  the  hough  transform,  which  is  shown  to  be  parallelizable  and 
conceptually  simple.  Chapter  five  examines  some  of  the  difficulties  faced  by 
the  final  stage  of  the  computation.  An  active  tracking  mechanism  is  shown 
to  lead  to  considerable  simplification  of  the  computational  requirement. 
Finally  the  last  chapter  concludes  by  reiterating  the  goals  of  the  research  and 
the  results  of  the  study. 

The  following  subsections  discuss  some  of  the  key  ideas  pertaining  to 
the  various  following  chapters  of  this  thesis. 

1.4.1.  Image  Motion  Measurement 

The  goal  here  is  to  examine  models  for  image  motion,  and  determine 
how  they  can  be  computed.  Intuitively,  as  well  as  from  psychophysical 
evidence  [36],  it  is  seen  that  two  dimensional  velocity  or  optical  flow  is  an 
adequate  and  useful  representation  for  image  motion.  Optical  flow  captures 
the  motion  and  structure  information  in  the  retinal  image  flux  and  is  thus 
an  abstraction  useful  for  theoretical  analysis.  Schemes  for  optical  flow 
measurement  proposed  in  the  literature  are  applicable  only  under  restrictive 
circumstances.  A  common  difficulty  encountered  occurs  in  image  regions 
where  contours  are  present.  In  this  situation,  components  of  the  optical 
flow  normal  to  the  local  contour  orientation  can  be  measured  (Marr  and 
Ullman  [56]  call  this  the  aperture  problem). 
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The  retinal  motion  representation  used  in  this  research  is  a  discrete 
form  of  optical  flow.  The  measurement  method  is  based  on  matching, 
“interest  points”  obtained  by  convolving  the  image  frames  with  a  set  of 
feature  masks  and  applying  a  simple  decision  rule  for  selecting  or  rejecting 
particular  retinal  locations.  The  relation  between  the  discrete  and  the 
continuous  representations  (e.g.  optical  flow)  is  analogous  to  that  between 
the  chord  and  the  tangent  to  a  continuous  curve  at  any  location.  The  curve 
referred  to  is  the  interpolated  trajectory  of  a  retinally  projected  world  point 

A  problem  with  retinal  motion  measurement  has  been  the  difficulty 
faced  by  researchers  in  segmenting  motion  fields  generated  by  more  than 
one  moving  object  [33] .  The  clustering  approach  adopted  in  our  proposed 
model  provides  a  uniform  scheme  for  dealing  with  both  the  matching  and 
the  segmentation  problem.  Local  interaction  between  motion  vectors  is 
modeled  by  a  similarity  function  similar  to  one  used  in  [69].  This  approach 
is  more  flexible  compared  to  using  the  mathematical  notion  of  smoothness 
to  constrain  the  motion  field  [44,  87],  since  the  latter  requires  a  dense 
sampling  of  the  retinal  space  in  order  to  estimate  derivatives  of  the  motion 
field.  The  method  proposed  here  been  tested  on  artificial  as  well  as  real 


data. 
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1.4.2.  Constraints  for  Motion  Analysis 

No  serious  study  of  rigid  motion  perception  can  be  successful  without 
an  understanding  of  the  geometrical  relationships  that  make  it  possible  to 
compute  three  dimensional  motion  from  retinally  projected  velocities  (or 
displacements).  The  geometric  analysis  should  be  aimed  at  answering 
questions  such  as,  what  is  computable  and  how  simple  are  the  computational 
steps  required.  One  should  be  aware  of  the  fact  that  the  representations  of 
the  various  entities  to  be  computed  at  all  stages  of  the  computation  must  be 
chosen  with  care  in  order  to  ensure  that  they  may  be  computed 
conveniently  and  there  is  no  unnecessary  redundancy. 

In  this  respect,  the  choice  of  parameters  to  represent  the  motion  of  a 
rigid  body  has  to  be  made.  A  simple  solution  is  to  represent  the  motion  by 
the  set  of  three  dimensional  velocity  vectors  corresponding  to  each 
observable  point  on  the  surface  of  the  body.  This  is  a  redundant 
representation  because,  a  rigid  body,  free  to  move  in  space  has  six  degrees 
of  freedom,  therefore  six  parameters  should  be  enough  to  describe  its 
motion. 

There  can  be  many  alternative  forms  of  the  six  parameters,  which  are 
equivalent,  but  numerically  not  identical  to  each  other.  Examples  of  such 
representations  are 

(i)  Translational  and  rotational  velocity  components  of  the  body. 
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(ii)  The  instantaneous  axis  of  rotation  and  the  rotational  velocity  of  the 

body. 

The  representation  that  is  chosen,  may  be  dependent  upon  the  ease  of 
computation  and  manipulation  in  the  particular  application  domain.  In  most 
of  the  geometrical  analysis  given  subsequently,  the  velocity  representation 
for  motion  is  used.  The  reason  is  that,  this  differential  approach  leads  to 
simplicity  in  the  algebraic  relations  used  in  the  analysis  without  diluting  the 
concepts  underlying  the  mathematical  characterization  of  the  problem 
domain. 

The  geometrical  transformation  in  the  eye,  or  camera,  giving  rise  to  the 
two  dimensional  image  from  three  dimensional  scenes  is  called  perspective  or 
polar  projection  (refer  to  chapter  three).  Another  model  of  transformation 
is  the  orthographic  projection,  which  is  an  approximation  of  polar  projection. 
The  constraint  equations  obtained  from  the  differential  analysis  embodies  a 
“small”  motion  approximation.  An  understanding  of  the  small 
displacement  approximation  is  essential  in  order  to  determine  under  what 
conditions  the  constraint  equations  are  valid  and  what  are  the  errors 
introduced  due  to  the  quantization  process  that  approximates  differentials  by 
differences.  These  issues  are  examined  later. 

It  is  known  that  [28,  84,  89]  a  single  monocular  observation  of  the 
optical  flow  field  may  not  be  enough  to  determine  the  three  dimensional 
motion  parameters  uniquely.  This  ambiguity  is  seen,  for  example,  in  the 
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motion  of  a  planar  surface.  Some  of  the  algorithms  that  have  been  proposed 
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for  recovering  motion  parameters  from  discrete  retinal  displacements,  have 
been  analyzed  to  ascertain  the  conditions  under  which  tin  computation  leads 
to  unique  results.  However,  there  has  been  no  examination  of  the 
uniqueness  question  that  is  independent  of  any  particular  algorithm. 

An  analysis  of  the  constraints  which  form  the  basis  of  any  approach  to 


a 


motion  perception  leads  to  the  following  results: 

(1)  The  motion  ambiguity  for  planar  surfaces  can  be  resolved  when  the 
orientation  of  the  plane  is  known,  even  partially,  meaning  tilt  angle  but 


I 


not  slant  is  available. 


(2)  In  general  there  can  be  at  most  three  interpretations  of  the  optical  flow 


field.  Hence  any  local  analysis,  e.g.  involving  spatio  temporal 
derivatives  of  flow,  must  involve  nonlinear  equations  (at  least  cubic), 
in  the  absence  of  shape  information. 

(3)  If  the  three  dimensional  velocity  of  the  rigid  body  under  observation 
varies  smoothly,  then  observation  of  the  flow  field  at  two  or  more  time 
instants  can  determine  the  motion  uniquely. 

(4)  Local  shape  information  (surface  orientation)  is  a  powerful  aid  to 


motion  perception. 

The  analytical  results  outlined  above  lead  to  an  understanding  of  the 
theoretical  basis  of  any  motion  perception  algorithm.  It  i'so  highlights  the 
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fact  that  the  task  is  difficult  due  to  the  inherent  nonlinearity  and  the  large 
size  of  the  parameter  set. 

1.4.3.  Algorithms  for  Motion  Perception 

Chapter  four  deals  with  algorithms  for  rigid  motion  perception,  based 
on  the  analysis  in  the  previous  chapter.  Some  of  the  basic  principles  are 
demonstrated  by  computer  simulations  using  synthetically  generated  data. 

The  computational  principle  underlying  the  design  of  the  algorithms  is 
the  hough  transform  [7,  0,  18].  The  idea  derives  from  histograming  in 
parameter  space.  Instances  of  the  constraint  hypersurfaces  “vote”  for 
parameter  values  that  are  compatible  with  it  The  parameter  estimated  to  be 
the  most  likely  candidate,  compatible  with  the  global  set  of  constraints,  is 
the  one  receiving  the  largest  number  of  votes.  This  vote  counting  can  be 
implemented  in  parallel,  by  a  connectionist  network.  However,  as 
mentioned  before,  the  number  of  units  and  connections  grow  exponentially 
with  parameter  space  dimension. 

The  hough  paradigm  has  been  explored  in  the  domain  of  motion 
parameter  estimation.  Some  of  the  limitations  of  the  approach,  brought  on 
by  the  nonlinearity  of  the  constraint  equation  are  examined  and  heuristics 


are  suggested  to  overcome  them. 
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1.4.4.  Active  TVacking  Constraints 

When  a  mobile  system  has  the  ability  to  visually  track  points  in  its 
environment,  it  can  be  shown  that  the  mathematical  relations  that  govern 
the  determination  of  the  motion  parameters  become  considerably  simpler. 
One  might  suppose  that  the  demands  of  a  tracking  system  might  overwhelm 
its  advantages.  In  other  words,  could  the  requirements  of  such  a  system  be 
more  difficult  to  achieve  than  the  original  problem  of  static  motion 
measurement?  It  will  be  argued  that  this  is  not  the  case,  since  the  tracking 
of  the  image  of  an  environmental  point  is  well  within  the  reach  of  current 
technology,  once  that  point  has  been  identified. 

The  mathematical  advantages  of  tracking;  As  we  have  seen,  there  are 
powerful  advantages  to  designing  a  motion  interpretation  system  based  on 
tracking.  The  arguments  in  the  foregoing  sections  have  been  mostly 
confined  to  the  retinal  structure  in  the  flow  field.  An  important  point  about 
the  tracking  regime  is  that  it  only  needs  retinal  motion  measurements  for  its 
sustenance.  One  can  expect  the  matching  of  eye  motion  to  the  retinally 
projected  motion  of  the  imaged  scene  to  facilitate  the  three  dimensional 
motion  measurements.  This  is  indeed  the  case,  in  fact  it  will  be  seen  that 
the  following  hold  true: 

(1)  In  the  monocular  case  the  number  of  parameters  in  the  motion 

constraint  equation  reduces  by  one,  without  any  increase  in  the  degree 


of  the  nonlinearity. 
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(2)  When  the  tracking  is  done  by  a  system  of  two  cameras  whose  relative 
positions  and  orientations  are  known,  then  the  constraint  equations 
reduce  considerably  in  dimension,  in  addition  to  being  linear  in  the 
parameters.  In  this  case  observation  of  the  optical  flow  at  just  two 
points  is  enough  to  determine  the  motion  parameters  completely. 

(3)  For  binocular  viewing,  it  is  necessary  to  combine  the  optical  flow  fields 
from  the  two  eyes.  However,  this  is  not  necessary,  when  the 
observation  period  extends  to  more  than  just  one  instant  of  time.  In 
this  case  one  can  obtain  closed  form  solutions  for  the  rigid  motion 
parameters  without  the  necessity  for  binocular  fusion. 

1.4.5.  Summary 

This  research  is  concerned  with  the  computation  of  rigid  motion 
parameters  from  spatio- temporally  varying  retinal  stimulus.  The  problem  is 
approached  in  three  stages.  These  relate  to  the  mathematical  and  geometrical 
relationships  that  exist  between  the  three  dimensional  parameters  and  their 
retinal  counterparts,  the  representations  that  can  be  computed  at  various  levels 
of  the  computational  process  to  facilitate  the  perception  process  and  the 
structure  of  the  computational  processes  themselves. 
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Chapter  Two 


Computation  of  Image  Motion 


2.1.  Introduction 

In  keeping  with  the  hierarchical  model  for  the  interpretation  of  visual 
motion,  the  first  task  that  is  investigated  concerns  the  measurement  of  the 
image  motion  stimuli  To  talk  meaningfully  about  this  latter  measurement 
process  it  will  be  necessary  to  define  the  input  and  output  quantities  and 
various  intermediate  representations.  The  objective  is  to  study  the  problem 
from  the  point  of  view  of  machine  vision.  However,  in  many  cases  the 
approach  adopted  is  based  on,  what  is  believed  to  be,  certain  principal 
aspects  of  biological  vision. 

Mathematically,  the  input  is  a  three  dimensional  (spatio-temporally 
varying  )  intensity  function.  The  spatial  coordinates  ( x,y )  of  this  function 
f(z,y,t)  refer  to  the  cartesian  indexing  of  the  retina  or  image  plane.  In 
reality  however,  as  far  as  the  human  eye  is  concerned,  the  available  input  is 
a  spatially  sampled  and  temporally  averaged  version  }{z,y,t)  of  the 
underlying  function  f(z,y,t).  Similarly,  for  the  machine  vision  case,  the 
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input  is  again  a  3 patio  temporally  sampled  version  of  the  “real”  image. 

The  temporal  discreteness  of  the  latter  imaging  situation  has  led  visual 
psychologists  to  designate  this  type  of  visual  input  as  apparent  motion  stimuli 
The  fundamental  distinction  between  the  “real”  and  apparent  motion  is 
stimulus  continuity.  However  this  distinction  will  not  affect  the  proposed 
computational  algorithms  under  the  proviso  that  the  spatio  temporal 
variations  (frequencies)  in  the  underlying  “real”  image  distribution  /(*,y,t) 
are  not  lost  in  the  sampling  process,  and  there  is  no  aliasing.  This  will  be 
called  the  adequate  sampling  assumption. 

The  second  issue  has  to  do  with  the  determination  of  what  to  compute. 
In  other  words,  what  is  an  adequate  explicit  representation  for  image 
motion.  An  answer  is  provided  by  the  notion  of  optical  flow,  a  concept 
attributed  to  J.J.  Gibson  [36].  The  optical  flow  field  can  be  thought  of  as 
the  retinal  projection  of  the  three  dimensional  velocity  field  that  could  be 
thought  as  the  representation  that  describes  the  motion  of  rigid  objects  and 
surfaces.  Of  course,  as  will  be  seen  in  chapter  three,  for  rigid  bodies,  there 
is  a  much  more  parsimonious  description  of  the  motion,  than  the  three 
dimensional  velocity  field.  None  the  less,  it  will  be  assumed  that  the  optical 
flow  representation  will  serve  as  an  adequate  representation  for  image 
motion  [63] . 

Optical  flow  is  an  idealistic  notion,  and  to  measure  it  requires  a 
continuous  motion  stimulus,  which  neither  the  biological  nor  the  machine 
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vision  systems  have  available  to  them.  But,  as  mentioned  before,  by  the 
adequate  sampling  hypothesis,  it  is  assumed  that  the  sampling  process  does 
not  entail  any  loss  of  information.  So  it  will  be  claimed  that  optical  flow 
can,  in  principle,  be  recovered.  There  are  essentially  two  alternative  stages 
of  processing  where  the  transition  from  discrete  to  continuous  may  be 
made.  Correspondingly,  there  are  two  distinctive  styles  or  classes  of  image 
motion  measurement  algorithms: 

I.  Continuous  Techniques:  The  sampled  image  function  /(*,*,  t)  is 
interpolated  at  the  onset  of  the  measurement  process,  to  obtain  the 
“real”  image  /{*,*,*).  The  subsequent  processing  can  then  be  based 
on  continuous  rather  than  discrete  transformations  and  operations. 

II.  Discrete  Techniques:  The  discrete  point  to  point  displacements  are 
computed  over  the  quantized  space  time  dimensions.  The  perceptual 
system  then  performs  smooth  interpolations  over  a  small  “integration” 
time  period  when  the  spatio  temporal  trajectories  of  the  observed 
“tokens”  are  smoothed  to  obtain  a  sparse  sampling  of  the  optical  flow 
field  at  the  retinal  locations  where  the  match  tokens  were  found. 

The  retinal  motion  measurement  algorithms  proposed  so  far  belong  to 
either  of  the  above  classes  [61].  Unfortunately,  all  such  algorithms,  be  they 
continuous  or  discrete,  suffer  from  ambiguity  or  local  indeterminacy.  For 
instance  in  the  discrete  case  locally  there  may  be  more  than  one  possibility 
for  finding  tokens  to  match  a  particular  token  item.  Marr  and  Ullman  call 
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this  the  aperture  problem  [56].  The  problem  arises  due  to  the  local  nature 
of  the  measurement  algorithm  and  hence  a  limited  field  of  “attention”  or 
aperture.  Thus  if  the  aperture  permits  the  viewing  of  only  a  part  of  some 
smooth  contour  (e.g.  a  straight  line),  then  due  to  the  fact  that  (here  is  no 
distinguishable  token  on  the  contour,  the  motion  of  points  on  the  contour 
can  be  constrained  (figure  2.1)  but  not  determined  exactly. 

Thus  if  the  instantaneous  optical  flow  field  is  denoted  by  r  ={«(*, y),*(x,y)}, 
and  the  constraint  available  in  the  local  aperture  is  C(f(x,y,t),u,v)  =0,  then 
at  every  sampled  image  location  (*,y)  there  are  two  unknowns  to  determine 
(i.e.  the  values  that  the  functions  u  and  v  take),  but  only  one  equation. 
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Figure  2.1  TTie  Aperture  Problem 
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To  overcome  the  local  ambiguity  additional  constraints  are  necessary. 
The  main  assumption  used  by  other  researchers  is  that  the  velocity  field  is 
globally  smooth.  In  terms  of  the  variational  calculus,  one  assumes  that  the 
constraint  equation  is  not  exactly  satisfied,  but  <?(•)  =<  where  t  is  the 
amount  of  error.  Now  if  the  smoothness  criterion  is  given  by  the  functional 
form  5(«,e)  =0,  then  the  measurement  problem  can  be  posed  as  minimizing 
^  =<?(•)  +  XS(-)  with  respect  to  ’u’  and  V,  where  X  is  a  Lagrange  multiplier. 

The  above  formulation  has  been  the  hallmark  of  a  number  of 
continuous  optical  flow  measurement  algorithms  [40,  44,  87].  One  of  the 
primary  problems  with  the  above  class  of,  spatio-temporal  gradient  based, 
methods  is  that  the  convergence  criteria  and  rate  for  the  relaxation  process 
are  not  known.  This  lack  of  performance  bounds  limits  the  applicability  of 
such  methods. 

The  second  type  of  continuous  formulation  seeks  to  eliminate  the 
inherently  iterative /search  nature  of  gradient  based  algorithms.  There  are 
two  ways  in  which  this  has  been  attempted,  one  is  to  implement  digital 
filters  that  are  sensitive  to  time  varying  intensity  patterns,  and  which 
purportedly  mimic  the  spatio  temporal  receptive  fields  of  the  biological 
system  (  an  example  can  be  found  in  [34]).  However  the  problem  here  is 
that  it  is  hard  to  determine  exactly  what  such  filters  indicate  quantitatively, 
although  qualitatively  the  outputs  of  such  filters  may  prove  to  be  useful  for 
discrimination  and  segmentation  purposes.  Another  type  of  motion 
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computation  involves  assumptions  regarding  local  structure  of  the  moving 
surfaces.  For  instance  assuming  that  the  visible  surfaces  in  motion  are 
locally  planar,  leads  to  the  locally  second  order  flow  fields  which  may  be 
easier  to  measure  (l,  00]. 

Surprisingly,  the  class  of  discrete  algorithms,  has  not  been  explored 
with  the  same  vigor  applied  to  the  study  of  the  continuous  methods.  The 
paradigm  of  token  matching  is  the  dominant  strain  for  such  methods. 
Contrary  to  continuous  methods  the  main  operating  criteria  here  are: 

(i)  The  motion  measured  be  due  to  the  geometrical  projection  moving 
features  in  three  dimensions. 

(ii)  The  measurement  be  immune  to  variations  in  lighting  and  viewpoint. 

(iii)  No  elaborate  form  analysis  precede  the  actual  measurement  operation. 

The  usual  approach  here  [14,  85)  is  to  assign  different  a  priori 
probabilities  or  confidence  to  the  competing  match  vectors  and  to  chose  the 
best  set  of  non  conflicting  matches  based  on  some  global  compatibility 
measure.  Of  these  two  methods,  the  first  [14]  employs  a  solution  method 
that  is  ad  hoc.  The  notion  of  similarity  is  never  exactly  quantified.  The  claim 
is  that  their  confidence  update  rule  captures  the  notion  of  local  similarity, 
however  their  method  never  makes  it  clear  the  exact  nature  of  this 
interaction.  The  rate  of  convergence  of  the  algorithm  is  also  not  shown. 
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On  the  other  hand  Ullman’s  exposition  [85],  which  also  concerns 
discrete  matching,  seeks  to  develop  a  mathematical  theory  of  visual  motion 
computation.  While  this  minimal  mapping  theory  is  in  itself,  an  exemplary 
work  in  the  field  of  scientific  exegesis,  it  makes  certain  strong  assumptions 
and  leaves  certain  questions  answered.  The  main  idea  is  simple  and  elegant, 
and  proposes  the  choice  of  matches  to  minimize  the  entropy  of  the  global 
field  of  matches.  Each  velocity  v  is  associated  with  an  entropy  measure 
q(v)  =-logp(»),  where  p()  is  the  probability  that  v  is  the  true  velocity.  Thus 
the  idea  is  to  assume  q(v)  as  the  cost  of  assuming  velocity  v,  so  that  it  can 
be  minimized  globally.  In  other  words  he  is  looking  for  the  maximum 
likelyhood  solution. 

The  problems  with  this  approach  mainly  stem  from  the  fact  that  to 
translate  the  above  idea  into  a  working  mechanism,  one  has  to  make  some 
simplifications.  The  simplifications  that  Ullman  proposes  require  the 
assumptions  that  the  probability  distributions  of  retinal  velocities  that  are 
obtained  at  different  image  locations,  are  independent  and  that  the 
probabilities  are  inversely  proportional  to  the  velocity  magnitudes.  It  is  very 
hard  to  justify  the  first  assumption,  while  the  second  is  a  very  coarse 
approximation  which  is  not  entirely  justified  by  empirical  data  presented  in 
[85].  The  remaining  objection  to  the  method  is  its  computational 
complexity.  Although  the  algorithm  is  formulated  as  a  linear  programming 
problem,  the  gradient  method  proposed  for  its  solution  need  not  converge 
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to  the  desired  solution,  and  in  fact  could  lead  to  cyclic  behavior  under 
certain  extreme  conditions.  Hence  it  is  difficult  to  envisage  a  network  of 
parallel  computing  elements  implementing  this  algorithm,  within  the  limits 
of  the  performance  constraints  imposed  by  a  “realistic”  (or  biological) 
computational  devices. 

The  discrete  algorithms  described  so  far,  we  will  classify  as 
discrete /iterative,  since,  in  their  proposed  forms,  they  involve  local  search 
with  ill  understood  rates  of  convergence. 

In  contrast  we  believe  that  a  computational  theory  of  image  motion 
should  try  to  satisfy  some  desirable  properties  not  encompassed  by  the 
above  motion  measurement  algorithms.  For  instance: 

(i)  The  velocity  estimates  for  a  single  region  should  collected  together, 
while  those  for  different  regions  should  separated  out  Thus  not  only 
should  our  theory  show  how  to  handle  local  similarity  but  also 
recognize  boundaries  where  dissimilarities  occur. 

(ii)  The  formulation  of  the  algorithm  should  also  bring  out  the  complexity 
of  the  computations,  and  justify  the  simplifications  introduced  to  reduce 
the  complexity. 

(iii)  Since  this  is  basically  a  low  level  visual  computation  algorithm,  the  style 
of  implementation  should  be  a  major  concern  during  theory 
formulation.  This  is  in  some  sense  in  opposition  to  the  well  known 
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position  by  M&rr  [57]  that  theory,  representation  and  implementation  are 
independent  of  each  other.  While  this  may  be  so  for  high  level 
symbolic  computations,  it  is  evident  that  at  lower  levels  this  epistemic 
neatness  is  infeasible.  For  instance  if  we  agree  that  a  highly  parallel 
hardware  is  desirable  and  even  necessary,  then  in  order  to  realize 
performance/cost  criteria  like  dynamic  range,  speed  and  efficient 
encoding,  we  might  have  to  resort  to  well  established  devices  like 
“coarse  coding’’,  quantization  of  the  measured  parameters  at  various 
degrees  of  coarseness,  and  so  on.  It  is  hard  to  imagine  a  probabilistic 
global  optimization  process  being  implemented  under  such  conditions. 
This  is  essentially  the  connectionist  argument  [6,  31]  applied  to  a 
concrete  case. 

The  image  motion  measurement  method  proposed  here,  is  based  on 
the  hough  transform  in  a  more  general  form  than  is  normally  prevalent  in 
the  vision  literature  [7].  The  method  is  clustering  based,  where  the 
complexity  of  the  measurement  process  and  the  segmentation  process  are 
treated  uniformly.  The  complexity  of  the  general  problem  is  very  clear  in 
this  approach  and  the  simplifications  that  allow  us  to  obtain  biologically 
plausible  parallel  implementations  using  minimal  spanning  tree  algorithms 
seem  justified  based  on  simulation  experiments.  Because  of  this  tighter 
complexity  and  computational  bounds,  we  classify  this  method  in  a  class 
distinct  from  the  other  discrete  methods.  The  latter  class  is  the 
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discrete/non-iterative  category  of  algorithms  for  image  motion 
measurement 

The  classification  of  the  algorithms  described  above  is  summarized  in 
Table  2.1. 

In  general,  the  methods  in  categories  I, II  and  m  are  specialized  and 
work  reasonably  well  only  in  restricted  domains.  Examples  of  restrictions 
imposed  are:  uniform  illumination,  smoothly  varying  reflectance,  being  able 
to  locate  smooth  zero  crossings  (of  the  Laplacian  of  the  image  intensity) 
contours  and  local  planarity  of  the  moving  surface.  One  shortcoming  of 
these  approaches  is  that  they  deal  primarily  with  movement  of  a  single 
object  in  the  environment  or  motion  of  the  observer.  Some  of  the  above 
techniques  are  sensitive  to  noise.  The  proposed  cluster  based  approach 
seeks  to  remedy  these  lacunae. 


Continuous 

Discrete 

Iterative 

I.  S patio- temporal  gradient 

II.  Matches  compatible  with 

methods 

[33,  38,  40,  44,  58,  62,  65]. 

local  constraints  [14,  66,  85] 

Non-Iter. 

III.  FIR  filters  or  local 

IV.  Proposed  cluster  based 

polynomial  approximations 
[34,  90] 

approach 

Table  2.1  Classification  of  Image  Motion  Measurement  Schemes 
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2.2.  Overview  of  the  Clustering  Algorithm 

The  algorithm,  can  be  implemented  very  easily  by  a  parallel  network  of 
relatively  simple  computing  elements  and  is  motivated  by  the  connectionist 
paradigm  in  AI  and  cognitive  modeling  [31,  32].  The  structure  of  the 
algorithm  closely  models  the  parameter  network  formalism  of  [9].  The 
measurement  of  image  motion  is  performed  in  the  lower  levels  of  a 
hierarchy  of  computing  layers  of  neuronal  computing  elements  (see  figure 
2.2). 

The  layered  structure  reflects  an  organizing  principle:  that  vision  can  be 
viewed  as  computing  key  parameters  at  different  levels  of  abstraction. 


Figure  2.2  The  Motion  Perception  Hierarchy 
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Furthermore,  a  natural  progression  of  abstraction  layers  start  from  low 
levels  (i.e.  the  image  intensity  function)  and  evolve  to  high  levels  (object 
tokens).  At  each  level  an  important  concept  is  the  size  of  the  local  spatial 
domain  in  the  image,  over  which  the  parameters  at  a  level  can  be  modeled 
as  invariant.  In  analogy  with  the  use  of  the  concept  in  biology,  we  term  this 
domain,  the  spatial  receptive  field  (SRF)  of  a  parameter. 

The  idea  of  a  parameter’s  SRF  can  be  best  understood  by  an  example. 
Figure  2.2  illustrates  the  hierarchical  layered  structure  in  the  computational 
model  of  motion  analysis  under  discussion.  The  lowest  (i.e.  the  rawest) 
level  of  representation  is  depicted  by  the  plane,  Ll,  consisting  of  the  image 
intensity  function.  In  the  first  stage  of  processing  locations  of  significant 
intensity  change  in  the  image  are  computed  (layer  L2)  -  note  that  change 
units  have  a  very  small  SRF.  The  next  layer  L3,  computes  the  retinal 
motion  parameter  (e.g.  optical  flow  ),  indexed  by  the  image  frame  positional 
coordinates  (x,y).  The  SRF  of  flow  parameter  value  is  much  greater  than 
that  of  the  change  units.  The  parameters  computed  in  the  following  layer 
L4,  can  be  thought  of  as  motion  vectors  that  are  not  specific  to  particular 
spatial  locations,  but  indicate  the  distribution  of  velocities  in  a  “window”  of 
the  image.  Later  descriptions  detail  how  clustering  in  this  “space”  of 
location  independent  (relatively  speaking)  motion  vectors,  helps  to  establish 
correct  matches  in  the  token  matching  layer  L3.  The  next  higher  layer 
computes  the  parameters  of  motion  of  the  imaged  surface  (  over  the  region, 
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which  need  not  be  known  to  the  measurement  process  a  priori, 


corresponding  to  a  single  moving  body  ). 


The  flow  of  data  is  from  the  lower  layers  of  this  hierarchy  to  the  upper 
layers.  There  are  exceptions,  however,  since  the  clustering  layer  L4 
influences  the  matching  process  in  the  lower  layer  L3  of  velocities  ( 
therefore  the  structure  is  not  a  strict  hierarchy).  The  design  of  the 
vo  nputation  in  the  layers  L2,L3  and  L4  and  how  retinal  motion  is  computed 
cooperatively  is  described  subsequently. 


The  scheme  of  computation  of  the  retinal  motion  has  the  following 


steps  : 


Figure  2.3  Matching  Interest  points  from  two  frames 
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(1)  Location  of  points  in  the  image  with  significant  intensity  variation  (contrast). 
The  goal  is  to  select  locations  in  the  image  ■where  there  is  significant  ( 
with  respect  to  its  neighborhood  )  contrast,  yet  there  is  no  orientation 
specificity.  This  means  that  two  criteria  must  be  satisfied: 

(a)  The  contrast  variation  at  and  around  the  selected  location  should 
be  high.  This  could  be  measured  by  means  of  a  center  surround 
operator  like  the  Laplacian  or  DOG  operator  [55,  57] 

(b)  A  good  edge  operator  should  not  respond  to  intensity 
distribution  at  the  same  location,  or  have  multiple  weak 
responses. 

The  interest  operator  that  was  used  is  described  in  [12].  This  operator 
decomposes  the  local  intensity  distribution  around  a  point  in  the  image  into 
a  set  of  basis  functions.  The  selection  of  the  point  then  depends  on  the 
relative  responses  of  the  "edge”  and  “extremum”  subspaces  in  the  basis 
set.  A  following  section  will  detail  the  design  and  operating  characteristics 
of  this  method. 

(2)  Measurement  of  image  motion.  It  is  assumed  that  the  velocity  field  is 
locally  similar,  except  for  a  small  number  of  places  where  motion 
boundaries  occur.  Each  point  selected  in  a  given  frame  (i.e.  time 
instant),  can  potentially  match  another  point  in  its  neighborhood, 
selected  from  a  later  frame.  The  process  is  depicted  in  figure  2.3  where 
the  large  circles  indicate  the  areas  searched  to  obtain  plausible  matches. 
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To  determine  the  goodness  of  the  match,  each  of  these  “velocity 
units”  (  or  plausible  match  vectors  )  evaluate  the  support  it  receives 
from  nearby  velocity  units.  This  scheme  was  used  with  great  success 
for  stereo  matching  by  Prazdny  [69].  Finally  only  those  matches  that 
can  muster  more  support  than  competing  matches  get  selected. 

(3)  Clustering  in  image  motion  space.  As  the  velocity  units  are  evaluating 
their  support,  they  also  “vote”  for  non  location  specific  velocity  units 
in  level  L4.  The  units  in  level  L4  then  cluster  around  similar  (  the 
Euclidean  distance  metric  is  used)  units  and  support  each  other.  This 
helps  to  remove  the  outliers  among  them.  The  units  that  belong  to 
some  cluster  are  then  retained  and  the  rest  are  deleted.  The  surviving 
units  then  mediate  the  matching  process  in  the  lower  level  L3.  The 
basic  idea  is  illustrated  in  figure  2.4. 

The  discussion  so  far  would  seem  to  imply  the  requirement  of  two 
“snapshots”  of  a  dynamic  scene,  taken  at  consecutive  time  instants.  This  is 
not  a  critical  aspect  of  the  method,  being  only  used  for  ease  of  explanation. 
In  fact  it  adapts  very  easily  to  a  sequence  of  temporal  frames  of  a  changing 
scene.  In  this  case  all  we  need  to  do  is  to  introduce  a  temporal  decay  rate 
for  the  accumulated  support  for  the  velocity  units  in  layers  L3  and  L4.  The 
algorithm  has  been  tested  with  synthetic  images  comprising  spheres  and 
planes,  which  where  painted  with  random  dot  patterns.  This  was  mainly 
done  so  that  the  computed  motion  vectors  could  be  compared  with  the 


Figure  2.4  Clustering  in  image  velocity  space 

actual  values.  The  experiments  with  synthetic  data  show  that  multiple 
moving  bodies  and  as  much  as  20%  random  noise  points  can  be  handled  by 
the  algorithm.  The  following  sections  detail  the  various  parts  of  the 
algorithm  and  the  experimental  results. 

2.3.  Computing  Interest  Points 

An  interest  point  is  a  point  in  the  image  (actually  a  small 
neighborhood)  which  has  properties  that  distinguish  it  from  its  neighboring 
points.  The  properties  in  question  may  be  simple,  like  gray  levels,  or 
sophisticated  ones  indicative  of  the  local  topography  of  the  imaged  surface. 
Previous  approaches  to  finding  interest  points  are  exemplified  in  the  work  of 
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Moravec,  Kitchen  &  Rosenfeld,  Nagel,  Davis,  Sun  &  Wu,  Fang  &  Huang 
([24,  27,  51,  58,  62]). 

The  difficulty  of  locating  interest  points  for  matching  stems  from  the 
fact  that  it  is  difficult  to  specify  exactly  what  should  be  the  desirable 
characteristics  of  such  feature  points.  On  the  other  hand  it  seems  clear  that 
the  following  properties  are  in  general  desirable: 

(i)  The  detection  and  localization  of  these  points  should  be  fairly 
straightforward.  In  other  words,  the  features  that  trigger  the  detection 
of  such  points  must  be  computable  by  examining  a  fairly  small  support 
region  in  an  image. 

(ii)  These  points  should  be  preferably  be  sparsely  distributed  in  the  image. 
A  measure  of  such  sparseness  depends  upon  the  support  region  size  for 
the  subsequent  matching  algorithm.  Thus  one  performance  parameter 
could  be  a  measure  of  the  average  number  of  false  matches  that  have 
to  be  considered  for  every  correct  pairing  of  points  from  two  frames.  It 
has  been  reported  that  human  performance  degrades  significantly  when 
this  number  increases  beyond  four  or  five  [78]. 

(iii)  The  feature  that  characterizes  the  points  must  be  stable.  This  means 
that  small  changes  to  viewpoint  and  illumination  should  not  affect  their 
determination. 
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Concrete  proposals  for  specifying  and  locating  feature  points  for  matching, 

fall  into  the  following  classes: 

1.  Grayvalue  corner  Selection:  The  feature  is  restricted  to  corners  in  the 
image  intensity  “terrain”.  The  method  of  detection  usually  consists  of 
finding  the  extrema  of  the  spatial  gradient  of  the  intensity  function  [51, 
55,  62]. 

2.  Interest  point”  selection :  The  idea  here  is  to  pin  down  image  patches, 
where  the  intensity  variation  profiles  are  distinctive  in  the  support 
region.  This  distinction  can  be  mathematically  specified,  for  instance, 
by  measuring  the  variance  in  the  pixel  intensity  values  and  selecting 
locations  where  it  is  maximised  within  a  local  support  region  [58]. 
Another  method  would  be  to  chose  patches  with  sharp  autocorrelation 
functions  [16]. 

3.  Selection  by  “ topological  analysis”:  This  method  attempts  to  label  the 
intensity  terrain  with  labels  such-  as  hill  (maxima),  pit  (minima), 
ridge/ravine  (line),  saddle,  table  edge  (edge)  and  fiats.  The  idea  is  that 
these  labels  being  relatively  viewpoint  and  illumination  independent, 
compared  to  raw  intensities  and  gradients,  can  be  used  to  perform 
sophisticated  correlation  type  of  matching  algorithms  [38]. 

The  problem  with  corner  detectors  and  the  topological  analyzers  is  that  the 
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distribution  in  the  support  window,  in  order  to  be  able  to  compute  the 
necessary  gradients.  This  requirement  of  higher  order  derivatives  of  the 
intensity  function  (e.g.  the  hessian)  and  the  attendant  computational 
complexity  diminishes  the  attractiveness  of  the  above  methods. 

The  variance  based  interest  operator  is  poor  at  contour  suppression 
(especially  for  edges  oriented  at  small  angles  with  respect  to  the  horizontal 
or  the  vertical  directions).  Furthermore,  being  intensity  based,  it  is 
sensitive  to  viewpoint  and  illumination  changes.  Finally,  the  idea  of  operator 
design  based  on  maximizing  autocorrelation  is  yet  to  be  translated  into  a 
successful  design. 

The  method  of  interset  point  selection  suggested  subsequently  seeks  to 
combine  some  of  the  positive  aspects  of  the  above  mentioned  techniques. 
The  salient  advantages  of  this  operator  are: 

(a)  The  selection  process  is  essentially  linear  (convolution)  and  amenable 
to  parallel  implementation. 

(b)  The  features  that  are  key  to  the  detection  process  are  not  intensity 
based.  In  fact  they  have  the  flavor  of  the  labels  computed  by  the 
topological  classification  methods,  without  the  attendant  computational 
complexity. 

(c)  Contours  are  suppressed  at  all  orientations. 
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(d)  The  method  does  not  depend  upon  thresholds  that  have  to  be 


determined  a  priori. 


The  principle  underlying  the  proposal  is  based  upon  the  comparison  of  the 


outputs  of  isotropic  feature  masks.  The  basic  scheme  involves  comparing 


the  responses  of  some  edge  operator  with  a  center/surround  operator  like 


the  laplacian.  This  scheme  is  depicted  in  figure  2. 5. a,  where  the  edge 


response  is  provided  by  the  Sobel  operator.  The  normalized  values  of  the 


edge  response  (solid  curve)  are  compared  with  the  response  of  the  laplacian 


(dotted  curve)  are  shown  in  figure  2.5.b,  for  a  step  edge  profile.  The 


responses  are  plotted  as  the  operators  are  applied  along  a  straight  line  path 
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perpendicular  to  the  edge.  So  we  see  that  if  we  were  to  compare  the 
operator  response  at  or  near  the  step  edge,  the  edge  response  will  always  be 
greater.  However  the  situation  is  different  when  the  image  profile  is  shaped 
like  a  corner.  In  the  latter  case  (figure  2.5.c),  the  response  of  the  laplacian  is 
stronger  than  that  of  the  sobel  operator. 

It  seems  that  this  simple  scheme  should  be  capable  of  selecting  interest 
points  in  an  image.  However,  there  are  some  shortcomings  of  this 
approach: 

(1)  The  sobel  operator  is  a  poor  detector  of  edges  oriented  away  from  the 
vertical,  horizontal  or  the  two  diagonal  directions.  Hence  the  above 
scheme  may  select  points  along  such  edges. 

(2)  There  are  many  distinctive  variation  patterns  of  the  image  intensity 
function  that  cannot  be  detected  by  such  a  scheme  (see  figure  2.6  and 
table  2.2). 

The  above  problems  make  it  necessary  to  reexamine  the  operator  design  if 
we  want  to  make  the  relative  response  measurement  criterion  work.  To 
alleviate  the  first  problem  one  could  resort  to  directional  edge  detectors,  like 
the  Canny  operator  [20].  The  difficulty  of  such  an  approach  is  that  the 
scheme  entails  the  the  replication  of  the  directed  operator  at  some  intervals 
of  orientation  angle.  Furthermore,  since  the  above  scheme  is  based  upon 
comparison  of  operator  responses,  it  is  easier  to  break  up  the  image  vector 
space  (this  concept  is  explained  subsequently)  into  orthogonal  bases  rather 
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than  resort  to  different  types  of  operators  for  different  types  of  features  and 
then  calibrating  the  relative  responses  on  a  large  number  of  sample  images. 

As  a  solution  to  the  first  problem  mentioned  above,  the  sobel  masks 
were  augmented  with  two  more  3x3  masks.  This  arrangement  is  reported  to 
be  more  isotropic  in  its  response  to  edges  of  arbitrary  orientations  [35). 

To  minimize  the  second  problem,  it  was  decided  to  investigate  the 
response  characteristics  of  rotation  invariant  filter  masks  [23].  These  can  be 
represented  mathematically  by 

M(r,+)  =A(r)e"*  n  =0, 1,2,3, ... 

where  j  =V^l,  k(r)  is  a  radial  weighting  function  and  (r,^)  are  polar 
coordinates  for  position.  It  is  generally  preferable  to  take  A(r)  to  be  a 
gaussian  function.  In  our  case,  since  for  simplicity  the  masks  were  limited  to 
3x3  size,  A(r)  was  taken  to  be  constant. 

In  this  case  the  operators  of  various  orders  (  corresponding  to  values  of 
’n’)  can  be  interpreted  as  follows: 

(a)  n  =  0  :  Averaging  operator. 

(b)  n  =  1  :  Step  edge  operator. 

(c)  n  =  2  :  Line  Detector  (  second  and  third  operators  in  figure  2.7) 


For  reasons  of  economy,  the  operator  size  was  confined  to  3x3.  However, 
these  operators  were  applied  at  various  image  resolutions  by  smoothing  and 
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very  important  property  of  the  set  of  mask  described  so  far  is  that  with  the 
addition  of  a  saddle  mask,  we  have  a  set  of  nine  3x3  masks  which  form  a 
complete  orthogonal  basis  for  the  space  spanned  by  the  nine  dimensional 
vectors  defined  over  the  real  field  (figure  2.7). 

In  addition,  the  space  spanned  by  the  feature  basis  set  can  be  thought 
to  be  subdivided  into  three  components  which  we  term  the  edge,  extremum 
and  average  subspaces.  The  addition  of  the  line  and  saddle  mask  (the  fourth 
operator  in  the  extremum  set  in  figure  2.7)  proves  to  be  more  powerful  for 
detecting  intensity  profiles  that  show  a  high  degree  of  curvature,  but  are  not 
like  step  edges.  Thus  for  example,  acute  comers  have  the  characteristics  of 
line  terminations  and  hence  trigger  the  extremum  detectors  due  to  the 
presence  of  the  line  operators  in  the  latter  space. 

2.3.1.  Feature  Classification  by  Orthogonal  Decomposition 

The  image  is  a  three  dimensional  function.  However,  we 

concern  ourselves  with  a  time  slice  of  this  function  at  time  =t»  thus 
obtaining  a  two  dimensional  function 

/(*,»)  =/(*>*. 1 1) 

An  image  vector  at  a  location  (*,f)  is  formed  by  concatenating  the  rows  of 
the  following  3x3  image  patch 

The  the  image  vector  X  belongs  to  a  9  dimensional  Vector  Space  defined 


over  the  Real  field. 
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I(x-l,y+  1) 

I(x,y+  1) 

I(x+  l,y+  1) 

i(x-i.y) 

Ifx.y) 

I(x+  l,y) 

I(x-l,y-l) 

^x.y-1) 

I(x+  l,y-l) 
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where  Ix  is  I(x-l,y-l),  I2  is  I(x,y-  l)  and  so  on,  alternatively 

»  —  S  h -e* 

M 

where  e*  is  the  column  of  the  9x9  identity  matrix. 

The  image  vector  as  defined  previously,  is  represented  with  respect  to  the 
basis  {«*,}.  The  components,  therefore,  by  themselves  do  not  convey  any 
information  regarding  the  local  topography  of  the  image. 

When  we  define  the  image  in  this  manner,  the  operation  of  convolving 
the  image  with  a  given  point  spread  function  or  correlating  with  a  particular 
feature  template  can  be  expressed  with  respect  to  the  vector  inner  product. 
Thus  convolution  becomes 

/*  =£ 

*  » 

where  X  is  the  vector  representation  of  the  point  spread  function  and  7  is  the 
image  vector  at  a  point 

With  the  above  interpretation  in  mind  we  freely  interchange  the  terms 
function  and  vector  in  subsequent  text  More  importantly,  thinking  of  point 
spread  functions  as  vectors,  allows  us  to  transform  the  image  vector  into 
different  finite  basis  space  corresponding  to  the  prototypical  features  that  we 
are  interested  in.  This  transformation  is  wrought  by  a  non  singular  matrix  T 

i 

i 

i 

i 
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whose  columns  are  the  feature  basis  vectors  {f,}  Thus  the  image  vector  7  is 
transformed  into  the  vector  v  where 

r-15 (2.1) 

The  purpose  of  this  transformation,  in  our  case  is  to  obtain  a  image  code 
whose  components  correspond  to  the  degree  of  match  between  the  image 
function  and  the  feature  functions.  In  general,  to  compute  the  transformed 
vector  tf  from  »'  requires  the  solution  of  simultaneous  linear  equations. 
However,  computation  of  the  k^1  component  of  if  becomes  simple  when  f* 
is  orthogonal  to  the  other  basis  vectors  in  the  set  {f,}.  In  this  case,  we  have 
from  equation  (l) 

»•**  =  !>/*/*» 

i 

=  Ilf*  IP 

In  particular,  if  the  basis  vectors  are  chosen  so  that  they  form  an 
orthonormal  set  then 

»**  =*» 

Since  the  image  vector  is  finite  dimensional  we  can  design  a  orthogonal 
basis  set  for  the  space  of  the  image  vector.  In  addition,  this  basis  set  is 
constructed  in  such  a  way  that  the  each  basis  corresponds  to  a  feature 
primitive.  Decomposing  the  image  vector  in  terms  of  the  new  basis  would 
give  us  a  new  set  of  components  (or  weights)  indicating  the  strength  of  each 
of  the  features  represented  by  the  respective  basis  vector.  This  idea  is 
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originally  due  to  Frei  and  Chen  [35’.  Their  purpose  was  to  develop  an  edge 
detector  which  would  not  require  thresholding  after  the  convolution  step. 
They  convolved  a  3x3  image  region  with  nine  orthogonal  masks  and 
compared  the  outputs  of  the  edge  masks  with  a  set  of  “line”  detection 
masks.  The  present  method  for  interest  point  selection  follows,  in  some 
sense,  a  strategy  that  is  a  dual  of  Frei  and  Chen’s  approach  with  a  different 
set  of  basis  functions. 

As  stated  before,  the  set  of  basis  functions  in  our  model  is  built  around 
feature  primitives  like  edge,  maxima/minima  and  saddle  type  variation. 
Since  the  image  vector  is  nine  dimensional  (i.e.  the  operator  size  is  3x3) 
there  are  nine  elements  in  the  feature  basis  space.  The  feature  space  is 
divided  into  three  subspaces: 

1.  The  Extremum  subspace  defined  by  the  laplacian,  line  and  saddle 
masks. 


2.  The  edge  subspace. 


3.  The  average  subspace. 

The  basis  functions  used  to  define  these  subspaces  are  shown  in  figure  2.6. 
A  key  characteristic  of  the  feature  subspaces  is  the  directional  isotropy  of 


their  response  patterns. 


To  test  the  applicability  of  the  above  image  decomposition  scheme  a  set  of 
image  profiles  were  devised  to  verify  the  ability  of  the  interest  operator  to 
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The  Averaging  function. 

Figure  2.0  Interest  Operator  Masks 

classify  some  idealized  markings.  The  test  images  were  3x3  masks  shown  in 
figure  2.7.  The  results  of  the  test  are  summarized  in  Table  2.2.  The  figures 
in  the  columns  indicate  the  normalized  responses  of  the  operators  to  the 
respective  image  masks.  The  last  two  columns  indicate  the  result  of  applying 
two  different  decision  rules  for  interset  point  selection.  It  is  seen  that  the 
simple  rule  which  compares  the  outputs  of  the  laplacian  and  the  sobel 
operator  is  inadequate,  although  it  is  enough  to  discriminate  between 
interesting  and  “edgy"  regions  in  many  instances. 
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No. 

Av.(M) 

Lap (L) 

Sobel(  E) 

Point(PS) 

Edgc(ES) 

PS  >  ES 

l.  >  E 

1 

0.34 

0.95 

0.00 

0.95 

0.00 

yes 

yes 

2 

0.34 

0.12 

0.58 

0.98 

1.16 

no 

no 

3 

3.01 

0.00 

2.31 

0.00 

3.47 

no 

no 

4 

3.01 

0.00 

2.45 

0.00 

3.47 

no 

no 

5 

1.01 

0.36 

1.16 

1.21 

1.74 

no 

no 

6 

3.01 

0.00 

0.00 

2.01 

0.00 

yes 

no 

7 

3.01 

0.00 

0.01 

2.01 

0.01 

yes 

no 

8 

6.34 

0.95 

0.01 

3.78 

0.01 

yes 

yes 

9 

4.34 

1.65 

0.01 

5.36 

0.01 

yes 

yes 

10 

6.67 

0.83 

2.05 

3.68 

3.47 

yes 

no 

11 

0.67 

0.83 

0.58 

1.68 

1.16 

yes 

yes 

12 

5.67 

3.30 

5.20 

6.30 

5.20 

yes 

no 

Table  2.2  Operator  responses  to  the  test  masks 


2.4.  Algorithms  for  retinal  motion  measurement 

The  correspondence  problem  is  almost  universally  regarded  as  difficult. 
As  mentioned  earlier,  it  arises  in  the  measurement  of  temporal  image 
disparities.  The  problem  is  magnified  for  motion  measurement,  since  the 
disparity  in  this  case  is  not  constrained,  as  in  the  case  of  stereo,  to  lye  on  a 
known  line  ( epipolar )  in  image  space.  The  overall  scheme  of  thing  is  simple: 
select  interest  points  in  image  frames  and  then  decide  which  point  from  one 
frame  matches  another  point  from  the  other  frame.  If  it  is  possible  to 
obtain  interest  points  that  are  sparse  then  correspondence  is  not  difficult. 
Here  sparseness  means  that  the  average  disparity  value  is  smaller  than  the 
’  average  spatial  distance  between  points  in  the  same  image  frame.  An 
interesting  quantification  for  the  degree  of  sparseness  is  due  to  Stevens  [78] 
and  is  the  number  of  false  matches  possible,  on  average,  for  a  given  match 
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neighborhood  size. 

2.4.1.  Tlie  Matching  Algorithm  using  local  support 

The  algorithm  proceeds  from  the  interest  point  stage  by  forming  all 
possible  matches  subject  to  a  maximum  limit  on  the  magnitude  of  the 
match  vector.  This  is  equivalent  to  saying  that  the  match  neighborhood  size 
is  determined  a  priori.  Each  match  vector  then  proceeds  to  accumulate 


m\ 
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Figure  2.7  Image  masks  to  test  Interest  Operator 
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evidence  supporting  its  existence  within  a  support  neighborhood,  which  is 
larger  than  the  match  neighborhood.  This  scheme  is  based  upon  the 
assumption  that  the  imaged  surface  depth  varies  smoothly  (a  similar  scheme 
is  reported  to  be  successful  with  the  stereopsis  problem  [ 69 J ) . 

To  justify  the  notion  of  local  support,  consider  optical  flow  («,») 
generated  by  a  translating  object.  In  this  special  case  the  constraint  equations 
are 


U  -  tW 
%=— Z~ 

V  -  yW 

9  = - 2 - 

z 

where  (U,V,  W)  is  the  translational  velocity  in  three  space,  Z  is  the  depth  of 
the  object  corresponding  to  the  retinal  location  (*,»). 

If  the  depth  function  Z(*,y)  is  smooth  then,  to  a  first  order  approximation, 
the  spatial  rate  of  change  in  the  optical  flow  is  proportional  to  the  spatial 
rate  of  change  in  depth.  For  instance  consider  the  optical  flow  value  at 
p  =(*o.yo)>  Let  it  be  (u0,«0)i  also  let  the  depth  at  p  be  Z0.  Then  the 
difference  between  optical  flow  at  p  and  a  neighboring  point  r  =  (*,*)  is  : 


6%  =  »(*,*)  -  «o  ex  ~Az  +  SZ |  —JT~ j 
where  A*  =*  -  *0  and  higher  order  terms  involving  and  ~~  are 


neglected.  This  leads  to: 
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\Su  I  <  X,  \SZ  I  +  X2  \6x  I 
M  <  X,| 6Z  |  +  X2|*y| 

where  X,,X2  and  X,  are  constants  for  a  local  image  neighborhood.  Combining 
the  above  we  have: 

d*t\z,y)  ~  d»t(z,y)  s 

where  X  =X,  +  X5,  dtt(u,v)  =  |£a  |  +  |5t>  |,  and  d$t(x,y)  =  \Sz  \  +  |jy|.  Thus  the 
situation  that  arises  here,  is  that  the  smoothness  in  the  depth  function 
relates  to  the  smoothness  of  the  displacement  field  The  support  that  two 
candidate  vectors  («!,«])  and  (tt}l«2)  at  retinal  locations  (zt,yi)  and  (x2,y,) 
respectively  provide  each  other  is  given  by  the  function 
Our  Experimental  results  indicate  that  a  linear  support  function  is  adequate. 
It  should  be  noted  that  in  Prazdny’s  algorithm  for  stereopsis  [69],  an 
exponential  support  function  is  used.  The  support  function  is  a  quantitative 
expression  for  the  notion  of  local  smoothness.  Prazdny’s  choice  of  support 
function  is  intuitive,  based  on  psychophysical  data.  The  same  justification 
applies  to  motion  correspondence.  As  an  example  consider  the  following 
exponential  support  function,  which  we  used  in  our  experiments: 

s(d,i)=^x P(-  (i-n 

where  /  =<**«(*, y)  and  d  As  mentioned  previously,  there  seems  to 

be  no  great  advantage  in  using  an  exponential  support  function  in 
preference  to  a  linear  one.  The  advantage  of  clustering  is  that,  once  the 
clusters  have  been  determined,  the  parameters  of  the  support  function  are 
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obtained.  Thus  once  the  maximum  velocity  difference  (i.e.  the  diameter  of 
the  cluster)  is  known,  the  largest  velocity  gradient  that  should  be  allowed 
can  be  calculated.  This  is  the  ratio  of  the  cluster  diameter  (i.e.  largest  linear 
distance  across  the  cluster)  and  the  diameter  of  the  support  window  in 

image  space.  Suppose  this  ratio  is  *  and  =  -^-(«-y),  then  the  linear 

support  function  is 

f f(d,l)  when  J  ( d,l)  >  0 

—  jp  otherwise 

The  algorithm  2.1  outlines  the  steps  involved  in  the  computation  of  the 
matches. 


Algorithm  2.1:  Finding  motion  correspondence  by  support  disparity  without 
clustering. 

begin 

Fl  :=  {X  |X  is  *  point  with  coordinate  (x,y)  on  the  first  frame  }; 

F2  :=  {  X  |X  is  a  point  with  coordinate  (x,y)  on  the  second  frame  }; 

(*  Computation  of  total  support  for  each  disparity  *) 

for  each  element,  p  of  Fl  do 
for  each  element,q  of  F2  within  a  radius,  R  of  p  do 
Totalsupport(pq)  :=  0; 
for  each  element,  r  of  Fl  do 
for  each  element,  s  of  F2  within  a  radius,  R  of  r  do 
Support  :=  support  provided  by  vector  rs  to  vector  pq 
end Jor, 

Totalsupport(pq)  :=  Totalsupport(pq)  +  Support; 
end Jor, 
endjor, 
end Jor, 
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(*  Finding  correspondence*  from  the  total  supports  *) 

for  each  element  p  of  Fl  do 

for  each  element  q  of  F2  within  a  radius, R  of  p  do 
Find  Maximum(Totalaupport(pq)) 

(*  the  vector  pq  corresponding  to  Maximum(Totalsupport(pq))  gives 
the  correspondence  *) 
end  _for, 
end Jor, 

end.  (*  algorithm  1  *) 

2.4.2.  Retinal  motion  detection  with  velocity  clustering 

The  simple  algorithm  presented  above  works  well  in  most  instances. 
However,  for  cases  where  there  are  a  large  number  of  match  possibilities  for 
every  point,  the  method  is  cumbersome.  In  such  instances,  a  separate  layer 
of  space  unspecific  displacement  (  or  motion)  units  are  computed.  This  is 
like  a  cluster  space  of  retinal  motion  parameters  (i.e.  «,*)  with  a  spatial  SRF 
that  extends  over  the  window  of  the  image  that  is  of  interest.  Each  unit  in 
this  cluster  space  collects  “votes”  or  support  from  the  location  specific 
displacement  (or  match)  vectors  of  identical  magnitude  and  orientation, 
from  the  layer  below. 

The  clustering  approach  to  visual  motion  measurement  and 
segmentation  has  a  number  of  attractive  features  that  will  be  mentioned 
shortly.  The  clustering  process  is  best  understood  in  terms  of  partitioning  of 
graphs.  Let  G  ~(V,E,W)  be  a  weighted  undirected  graph  with  vertex  set  V, 
edge  set  E  and  a  distance  or  weight  function  W:E->R*  (the  set  of 
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nonnegative  reals).  A  partition  of  the  set  of  vertices  into  k  sets  (C„£72,.., C4), 
is  called  a  k-split.  The  sets  C,  are  called  clutUn.  In  addition  there  is  an 
objective  function  defined  on  the  k-split. 

The  clustering  problem  can  now  be  defined  in  two  ways.  In  case  it  is  known 
a  priori  how  many  clusters  (  i.e.  k  )  are  present  then: 

Definition  I:  Given  a  graph  G,  an  objective  function  f  and  an  integer  k,  find 
the  k-split  which  minimizes  the  objective  function.  In  other  words,  find 
that 

=mu»  a  k-iplit  for  G} 

Under  some  circumstances  however,  the  number  of  clusters  are  not  known 

a  priori.  In  this  case  one  can  specify  a  threshold  0,  whence  the  clustering 

process  is  defined  as: 

Definition  II:  Given  a  graph  G,  an  objective  function  V  and  a  positive  real 
valued  threshold  0,  find  for  the  least  value  of  k,  a  k-split  with  objective 
function  value  <  #. 

Clustering  can  also  be  defined  with  respect  to  an  n-dimensional  feature 
space  in  an  exactly  analogous  manner.  Of  course  in  this  case  one  must 
formulate  a  distance  metric  for  points  in  the  feature  space.  An  alternative  to 
distance  or  dissimilarity  function  is  a  similarity  function,  examples  of  which 
were  cited  in  the  previous  section.  In  the  case  that  similarities  are  used  in 
place  of  distances,  the  objective  function  is  usually  maximised.  Note  that 
the  definition  does  not  require  the  distance/similarity  to  be  metric. 
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The  objective  function  ^  plays  a  crucial  role  in  clustering.  As  mentioned 
before  the  motion  vectors  and  their  spatial  positions  form  elements  that 
belong  to  a  four  dimensional  space.  It  is  expensive  to  compute  clusters  in 
this  four  dimensional  space.  Conceptually,  this  is  an  attractive  framework  in 
which  to  view  the  motion  measurement  and  segmentation  problem.  The 
algorithm  in  the  previous  section  is  similar  to  the  stereo  algorithm  proposed 
in  [69].  On  the  other  hand,  the  cluster  based  formulation  unifies  the  notion  of 
matching  and  segmentation.  The  assumption  is  that  with  adequate  data  from 
all  the  different  surfaces  moving  in  the  visual  field  separate  significantly 
large  clusters  (compared  to  random  fortuitous  clusters)  will  indicate  the 
corresponding  motion  segment.  Then,  the  desirable  matches  belong  to  one 
of  these  larger  clusters  while  mismatches  are  scattered  into  small  noise 
clusters.  In  order  that  the  clusters  be  well  rounded  and  not  in  the  form  of 
“stringy  chains”,  the  objective  function  must  be  chosen  with  care. 

A  good  clustering  strategy  is  provided  by  the  so  called  complete  linkage 
or  furthest  neighbor  technique.  The  objective  function  in  this  case  is  defined 
to  be  the  largest  distance  between  pairs  of  elements  computed  over  all  pairs 
that  belong  to  the  same  cluster.  Unfortunately  clustering  with  this  function 
is  known  to  be  NP-hard  for  feature  spaces  of  dimension  two  or  more,  when 
there  is  more  that  one  cluster  [75]. 

To  overcome  the  problem  of  large  dimensionality  and  computational 
complexity,  the  method  adopted  was  to  project  the  four  dimensional  feature 
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space  into  a  two  dimensional  space  of  motion  vectors  without  spatial 
indexing.  Furthermore,  simulation  with  various  synthetic  data  showed  that 
the  problem  of  chain  formation  showed  up  vary  rarely  in  the  projected 
space.  For  this  reason  an  agglomerative  hierarchical  clustering  strategy  was 
adopted.  Initially  all  the  elements  belong  to  singleton  clusters.  At  each  stage 
of  processing  (for  simplicity  assume  sequential  execution  of  the  stages), 
each  cluster  merges  with  its  nearest  neighbor.  The  process  continues  until 
there  is  only  one  cluster.  This  is  called  the  single  linkage  method  and 
essentially  generates  a  minimal  spanning  tree  of  the  feature  space  graph,  in 
which  the  nodes  are  the  elements  to  be  clustered  and  the  edges  are  the 
distances  between  them. 

The  computational  complexity  of  the  algorithm  is  Ofn^ogn)  for  the 
serial  case  with  n  elements.  The  algorithm  can  be  easily  implemented  in 

j 

parallel  with  p  processors  with  a  complexity  of  0(~— logn)  [76].  Another 

P 

advantage  of  adopting  the  clustering  view  is  that  a  number  of  suboptimal 
algorithms  have  been  published  and  could  be  used  for  this  application  (  e.g. 
see  [37]  for  a  factor  2  O(nJfc)  algorithm). 

The  implemented  program  follows  an  algorithm  given  in  [25].  The 
clustering  metric  is  the  Euclidean  distance  between  two  motion  vectors.  The 
number  of  clusters  depends  upon  a  threshold  for  the  similarity.  This 
threshold  is  chosen  depending  on  it  stability,  meaning  that  small  changes  to 
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it  should  not  affect  the  clustering  in  any  significant  way.  The  cluster  trees 
(or  dendograms)  can  be  seen  for  the  synthetically  generated  data  of  two 
differently  moving  surfaces  in  figure  2.8. 

The  clusters  so  formed  now  compete  against  each  other  and  only  the 
larger  clusters,  i.e.  the  ones  with  accumulated  votes  in  the  same  order  of 
magnitude,  are  kept.  These  clusters  then  mediate  the  matching  process  in 
the  lower  level  of  displacement  vectors  (figure  2.4).  By  this  process  two 
things  are  achieved: 

1.  Noise  points  and  spurious  matches  are  avoided. 

2.  In  the  case  of  multiple  body  motion,  the  clusters  provide  a  convenient 
label  for  segmenting  the  displacement  field. 


An  outline  of  the  algorithm  follows: 


Algorithm  2.2:  Finding  motion  corrtspondtnct  by  clustering  followed  by 
application  of  support  disparity. 


FRl  :=  pC  |X  is  a  point  with  coordinate  (x,y)  on  the  firet  frame  } 
FR2  :=  pC  |X  is  a  point  with  coordinate  (x,y)  on  the  second  frame  } 


(*  setting  up  the  table  for  clustering  *} 

for  each  element,p  of  Fl  do 
for  each  element^  of  F2  within  a  radios, R  of  p  do 

displace  ment_x_direction  :=  (x  coordinate  of  q)  •  (x  coordinate  of  p); 


■MU 
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displacement_y_direction  :=  (y  coordinate  of  q)  -  (y  coordinate  of  p); 
clustertable[diaplacement_x_direction  ,  displacement^  _direction]  := 
dustertable [diaplacement_x_direction  ,  displacement_y_direction]  ■+■  1; 
end_tor, 
end_for, 


Find  the  clusters  m  the  two-dimensional  array  clustertable; 

Remove  cluster*  with  weak  overall  support  (votes); 

From  the  clusters  find  the  feasible  disparities; 

Consider  points  in  the  feasible  disparity  ranges  only,  and  apply  Algorithm  1; 
end.(  ‘algorithm  2*) 


The  matching  algorithm  is  formulated  according  to  whether  the  points  are 
labeled  or  not.  In  case  of  unlabeled  points  (as  in  the  above  algorithms)  : 

All  neighboring  points  support  (vote  for)  a  particular  disparity  value. 
Similar  values  support  each  other  strongly  in  a  local  region.  Shorter 
length  disparities  are  preferred.  A  point  adopts  a  match  for  which  it 
finds  the  maximum  support. 


The  strategy  is  similar  in  spirit  to  the  more  sophisticated  matchers,  for 
instance,  those  using  labeled  points  (e.g.  [66]).  The  feature  points  can  carry 
labels  which  are  computed  from  the  outputs  of  the  nine  basis  operators.  A 
label  is  a  code  that  identifies  the  image  point  in  question.  Now  the  matcher 
weights  the  “supporting”  votes  according  to  the  similarity  of  these  codes. 
However,  we  avoid  iterative  refinement,  which  is  usually  employed  in 
similar  algorithms  [14]. 
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2.5.  Experiments 

Synthetic  Images:  Experiments  have  been  conducted  on  synthetic  images 
of  spheres  and  planes  “painted”  with  random  dot  patterns.  All  the  objects 
are  opaque  and  sometimes  intersect  each  other.  The  choice  of  spheres  and 
planes  is  motivated  by  the  necessity  of  local  smoothness  in  the  imaged 
depth.  Yet,  at  the  same  time,  since  there  are  multiple  differently  moving 
objects  as  well  as  occlusion  of  one  body  by  another,  motion  boundaries  do 
occur. 

The  image  formation  technique  was  the  perspective  projection.  In  a 
single  image  there  could  be  one  or  more  instances  of  the  above  primitive 
objects  moving  with  similar  or  different  velocities  (translational  and 
rotational)  in  3-space.  An  illustrative  depth  map  of  the  surface  of  a  sphere 
embedded  in  a  plane  is  shown  in  Plate  2.1. 

For  single  body  motion  the  matches  were  found  with  close  to  100% 
accuracy.  Addition  of  uniformly  distributed  uncorrelated  noise  points  to  a 
level  of  10%  did  not  cause  any  significant  difference  in  the  level  of  correct 
matches  found.  However,  the  noise  points  generated  some  spurious  matches 
among  themselves.  The  clustering  approach  works  better  in  this  situation 
with  considerable  removal  of  noise  points  and  false  matches. 

As  a  conservative  estimate  the  average  number  of  plausible  match 
vectors  considered  was  of  the  order  of  10  -  15.  Of  course  in  regions  with 
dense  random  dot  patterns  this  number  was  more.  Even  with  larger 
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Figure  2.8  Cluster  Tree  for  two  body  Motion 

numbers  the  selection  of  the  correct  match  was  possible  with  the  support 
disparity  approach.  These  figures  for  plausible  match  vectors  fell  to  a  third 
of  their  number  with  the  clustering  approach.  There  was  also  a  speedup  of 
execution  by  a  factor  of  around  ten. 

With  two  body  motion  about  94%  of  the  correct  matches  could  be 
obtained.  The  hardest  matches  to  find  lay  on  the  border  of  the  two  bodies, 
as  was  to  be  expected.  The  dendogram  (cluster  tree)  is  shown  in  figure  2.8. 
Also,  from  Plate  2. IV  it  can  be  observed  that  the  matches  have  been  found 
correctly  almost  everywhere  (comparing  with  Plate  2. II).  The  exception 
occurs  at  the  boundary  of  the  two  bodies,  where  incorrect  matches  were 
found.  An  intensity  coded  view  of  the  cluster  generated  for  this  case  is 
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shown  in  Plate  2. III.  The  pixel  positions  and  intensity  represent  the  location 
(in  velocity  space)  and  population  of  the  relevant  cluster  (or  bin).  In  the 
case  of  totally  transparent  bodies  the  algorithm’s  performance  is  drastically 
reduced  with  50  -60  %  wrong  matches  being  found. 

Images  of  Natural  Scenes:  Quantitative  justification  of  performance  is 
difficult  on  natural  scenes.  Through  manual  inspection  it  has  been  found 
that  the  number  of  wrong  correspondences  obtained  are  insignificant.  Plate 
2.V  demonstrates  the  result  of  applying  the  algorithm  on  a  natural  scene. 
The  top  right  box  depicts  a  single  snapshot  (frame)  of  the  scene,  while  the 
bottom  right  box  shows  the  interest  points  computed,  superposed  on  the 
scene.  The  box  at  bottom  left  illustrates  the  computed  image  motion 
vectors,  which  can  be  compared  with  the  manually  computed  vectors.  The 
latter  were  obtained  by  selecting  the  points  to  match  from  two  frames  by 
inspection  and  then  matching  them  as  they  were  selected  manually.  In  Plate 
2.VI,  points  from  two  successive  time  frames  are  shown  superposed,  to 
illustrate  the  input  available  to  the  clustering  algorithm.  Here  the  coding  of 
the  points  from  different  frames  is  done  by  bright  and  dim  dots.  Finally  the 
cluster  obtained  for  this  image  sequence  is  shown  in  Plate  2.VTL  However, 
correct  matches  associated  with  roughly  40%  of  the  points  have  been  found. 
This  discrepancy  is  a  result  of  the  uncertainty  associated  with  the 
determination  of  interest  points. 
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It  is  estimated  that  even  with  some  amount  of  input/output  processing, 
the  part  of  the  alforithm  which  could  be  parallelized  took  about  75%  of  the 
time  on  a  serial  processor.  With  the  removal  of  file  manipulations  this 
could  rise  to  over  95%  or  more  of  the  time.  It  is  feasible  to  implement  the 
algorithm  on  a  128-processor  MIMD  machine  (BBN  Butterfly)  with 
considerable  improvement  in  running  time. 

2.8.  Conclusions 

The  goal  of  the  research  here,  was  to  formulate  a  computational 
framework  for  the  measurement  of  retinal  motion.  It  was  desirable  that  the 
motion  measurement  algorithm  be  implementable  in  parallel,  and  conform 
to  a  connectionist  implementation  strategy.  An  important  consideration  was 
graceful  degradation  in  the  presence  of  increasing  amounts  of  noise,  and  the 
ability  to  handle  multiple  moving  objects.  An  important  issue,  relating  to  the 
task  of  retinal  motion  measure  is  the  choice  of  the  matching  primitive  or 
token  and  the  process  for  obtaining  these  primitives  in  an  image.  The 
overall  framework  of  the  algorithm  is  based  on  the  matching  paradigm  of 
motion  perception.  This  is  based  upon  the  belief  that  some  form  of 
matching,  either  involving  spatio-temporal  gradients  or  other  feature 
primitives,  is  essential  to  solving  the  motion  perception  problem.  This 
paradigm  is  by  no  means  inviolable,  as  has  been  shown  recently  in  [3],  for 
certain  imaging  situations. 

i 
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The  work  with  synthetic  images  served  to  lay  the  preliminary 
groundwork  for  evaluating  the  proposed  matching  algorithm.  The  success  of 
this  study  showed  that  the  scheme  is  reliable  enough  to  test  on  natural 
images.  Of  course  the  heart  of  the  matter  is  to  be  able  to  determine  the 
interest  points  without  elaborate  processing.  Thus  experiments  with  natural 
images  was  thought  to  be  contingent  upon  being  able  to  formulate  and 
compute  feature  primitives  that  are  stable  and  recoverable  with  local 
operators.  The  orthogonal  decomposition  operator  described  here  (see  also 
[12])  proved  adequate  for  the  purposes  of  applying  the  clustering  algorithm 
on  natural  images.  This  “interest”  operator  is  simpler  than  other  comer 
finding  algorithms  like  the  ones  described  in  [27,  51,  62],  although  its 
performance  is  comparable  to  the  best  of  them  ([80]).  Incidentally,  the 
operator  described  here  was  also  used  recently  in  another  image  processing 
context  with  considerable  success  (see  [4]). 

To  summarize,  a  list  of  the  salient  advantages  of  our  algorithm  is  given 
below: 

1.  Applicable  to  multiple  moving  objects. 

2.  Good  behavior  in  the  presence  of  noise. 

3.  Automatic  segmentation  for  image  areas  projected  from  different 
moving  objects  or  parts  of  the  same  rigid  surface  differentiated  by  sharp 
depth  changes. 


Image  Motion 


69 


4.  The  clustering  formulation  proposed,  is  a  mathematically  well  defined 
paradigm  for  motion  segmentation.  Furthermore,  the  complexity  of  this 
approach  is  well  understood  and  efficient  suboptimal  algorithms  can  be 
used. 

5.  Conceptual  simplicity  and  amenability  to  parallel  implementation. 

6.  Matching  and  segmentation  are  handled  uniformly,  under  the  same 
paradigm. 


PLATE  I.  Example  of  a  synthetically  generated  surface 


CORRECT  MATCHES 
2-BODY  MOTION 


PLATE  2.IV.  Computed  two  body  motion  matches. 


PLATE  2.V.  Results  obtained  on  a  natural  image. 

(i)  computed  vectors  bottom  left 

(ii)  manually  determined  vectors  top  left 

(iii)  interest  points  from  1st  frame  at  bottom  right 
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Chapter  Three 


Physical  Constraints  on  Image  Motion 


3.1.  Introduction 

This  chapter  establishes  a  mathematical  framework  for  investigating  the 
motion  perception  problem,  with  a  view  to  understanding  the  adequacy  of 
the  resultant  mathematical  constraints.  The  reader  who  is  knowledgeable  in 
the  basics  of  the  area,  can  start  with  the  last  section  of  the  chapter,  which 
summarizes  the  contribution  towards  theoretical  understanding  of  the 
mechanics  of  motion  interpretation. 

The  motion  of  a  body  can  be  characterized  by  the  rate  of  change  of  the 
positions  of  various  points  on  its  visible  surface.  Instantaneously,  this 
corresponds  to  a  three  dimensional  velocity  field.  If  the  body  (or  surface)  is 
rigid,  then,  this  velocity  field  can  be  described  by  the  set  of  three 
dimensional  position  coordinates  and  six  global  parameters  (see  figure  3.1), 


which  are: 
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(i)  The  three  components  of  the  velocity  of  any  point  O  on  the  body. 

These  are  called  the  translation  parameters. 

(ii)  The  rotational  velocity  components  of  a  coordinate  frame,  with  origin 

O,  attached  rigidly  to  the  body. 

A  standard  result  from  kinematics  and  geometry  (see  [21])  is  that  although 
the  rotational  parameters  are  invariant  with  respect  to  the  choice  of  the 
origin  O,  of  the  body  frame,  the  translation  parameters  are  dependent  on 
the  choice  of  O. 

In  general,  computing  three  dimensional  motion  from  monocular  two 
dimensional  image  motion  flux  is  an  underdetermined  problem,  admitting 
an  infinite  number  of  solutions.  However,  most  of  the  moving  objects  in 
our  environment  are  rigid,  and  the  rigidity  constraint  greatly  simplifies  the 
task  of  representation  and  analysis  of  visual  motion  [86].  From  a  practical 
standpoint,  the  study  of  rigid  body  motion  is  interesting,  since  it  finds 
widespread  applications  in  the  areas  of  optical  navigation,  tracking  and 
recovery  of  3D  structure  of  rigid  objects.  The  following  analysis  explores  the 
ramifications  of  this  central  assumption  in  the  interpretation  of  three 
dimensional  structure  and  r  otion  from  two  dimensional  image  motion  (see 
Ullman’s  paper  [88]  for  a  discussion  of  nonrigid  motion  perception). 

The  previous  chapter  introduced  the  notion  of  optical  flow  as  an 
abstraction  for  image  motion.  It  is  a  fact  that,  as  yet,  it  seems  very  difficult 
to  compute  optical  flow.  However,  it  can  be  estimated  by  spatio-temporal 
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interpolation  from  discrete  displacement  measurements,  when  the  sampling 
is  “adequate”.  Under  this  adequate  sampling  assumption,  the  following 
analysis  deals  with  the  optical  flow  representation  for  image  motion. 

When  considering  motion  of  rigid  bodies,  there  are  two  cases  of 
interest,  namely,  egomotion  and  general  motion.  Egomotion  or  self-motion 
refers  to  the  movement  of  the  camera  or  sensor  in  a  static  environment. 
The  image  flux,  or  optical  flow,  generated  due  to  such  a  motion  is  due  to  a 
single  relative  movement,  i.e.  between  observer  and  static  environment  In 
contrast,  general  motion  implies  that  there  is  more  than  one  object  moving 
with  different  velocities  in  the  observers  field  of  view.  In  this  case  the 
optical  flow  field  consists  of  many  segments  corresponding  to  the  various 
moving  surfaces.  Each  segment  is  characterized  by  the  translational  and 
rotational  velocities  of  the  associated  moving  rigid  surface  inducing  the 
optical  flow.  These  velocities  are  called  the  parameters  of  motion  for  the 
rigid  surface. 

The  rigid  motion  parameters  are  usually  expressed  with  respect  to  a 
frame  of  reference  attached  to  the  moving  surface,  which  is  assumed  to 
coincide  with  the  observers  frame  of  reference  at  the  time  of  observation. 
The  problem  is  to  determine  the  motion  parameters  corresponding  to  a 
optical  flow  field  segment  If  the  depth  of  the  scene  is  unknown  then  it  can 
be  shown  that  only  the  rotation  -  which  is  depth  invariant  -  can  be 
determined  uniquely;  whereas  the  three  translation  parameters  can  only  be 
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determined  up  to  a  scale  factor  (this  is  the  depth  scaling  effect).  Thus  we 
can  determine  five  parameters  to  characterize  the  motion  in  this  case. 

The  concern  here  is  with  the  physical  constraints  that  make  it  possible  to 
compute  the  five  parameters  of  rigid  motion  and  the  structure  of  the  moving 
surface  from  retinal  stimulus  such  as  optical  flow. 

The  optical  flow  field  comprises  two  parts,  corresponding  to  the  rotation 
and  the  translation,  respectively,  of  the  inducing  motion,  and  is  constrained 
at  every  point  by  the  parameters  of  the  motion.  Motion  perception  becomes 
simpler  in  the  instances  when  the  optical  flow  field  can  be  computationally 
separated  into  the  respective  components  [39].  A  familiar  illustration  of  this 
is  the  case  of  motion  parallax  observable  at  depth  discontinuities  in  the 
retinal  field.  The  effect  is  to  reduce  the  dimensionality  of  the  space  of 
unknowns.  Unfortunately,  this  seems  to  be  very  hard  to  accomplish,  in 
general.  Motion  parallax  is  the  basis  for  an  algorithm  by  Lawton  [70]. 
Other  approaches  to  the  problem  can  be  found  in  [19,  52],  involving 
nonlinear  least  square  techniques  or  using  local  constraints  involving 
derivatives  of  the  optical  flow. 

Algorithms  for  rigid  motion  perception  are  difficult  to  design  due  to  two 
main  reasons: 

(1)  The  space  of  parameters  is  of  high  dimensionality  (e.g.  five). 
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(2)  The  constraint  equations  obtained  by  optical  flow  measurements  are 

non-linear. 

There  have  been  some  clever  implementations  of  non-linear  search 
algorithms  to  interpret  3D  motion  from  optical  flow  data  [67,  68].  There 
have  also  been  discrete  point  tracking  algorithms  by  Tsai  and  Huang  [84] 
and  Fang  and  Huang  [28,  29]  and  Longuet-Higgins  [53].  In  some  of  the 
latter  algorithms,  the  nonlinear  motion  equations  are  linearized  in  terms  of 
synthetic  parameters,  which  are  nonlinear  combinations  of  the  actual  motion 
parameters.  Tsai  and  Huang,  and  Fang  and  Huang,  note  the  cases  when 
such  algorithms  fail  to  compute  motion  parameters. 

An  important  aspect  of  the  following  analysis  will  be  the  examination  of 
situations  where  the  monocular  optical  flow  field  could  be  interpreted  in 
more  than  one  way.  An  instance  of  such  ambiguity  is  the  optical  flow  field 
due  to  motion  of  a  plane  [82]. 

A  geometric  analysis  of  the  problem  of  computing  3D  motion 
parameters  from  2D  image  velocities  has  been  done  by  Longuet-Higgins  and 
Prazdny  [52].  The  constraint  equations  that  they  derive  are  simple  in  form, 
but  deal  with  velocities.  To  implement  a  motion  analysis  algorithm  based 
on  these  equations,  one  makes  the  assumption  that  the  temporal  grain  of 
the  observations  is  fine  enough  to  talk  meaningfully  about  the  velocities  or 
time  derivatives  of  both  the  image  and  world  positions.  Representing 
motion  by  velocity  parameters  entails  making  a  first  order  approximation  of 
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the  temporal  behavior  associated  with  the  motion.  Thus,  for  example,  if  the 
displacement  of  a  particle  moving  in  one  dimensional  space  is  Ax  in  time  At, 

then  —  is  a  good  approximation  for  the  velocity  only  when  At  is  small 
At 

enough  such  that  the  change  in  velocity  in  this  time  period  is  small. 

An  alternative  derivation  is  due  to  Tsai  and  Huang  [84].  Their 
approach  is  to  analyze  the  relation  between  the  projected  displacement 
vectors  in  the  image  plane  due  to  an  arbitrary  rigid  displacement  of  a  set  of 
points  in  3D.  It  is  known  [21]  that  this  type  of  motion  can  be  characterized 
by  a  rotation  about  an  axis  passing  through  the  origin  of  the  reference 
coordinate  frame  and  a  translation. 

The  assumptions  underlying  the  work  reported  here  are: 

(i)  The  motion  being  observed,  is  due  to  a  rigid  surface. 

(ii)  The  time  constant  (or  sampling  interval)  of  the  sensor  is  small  enough 
to  make  a  first  order  approximation  of  the  temporal  behavior  due  to  the 
motion  being  observed. 

3.2.  Review  of  related  work  in  the  analysis  of  motion  geometry 

The  computation  of  rigid  motion  parameters  from  image  displacement 
vector  fields  has  been  studied  by  a  number  of  researchers.  Egomotion  has 
been  considered  by  Longuet-Higgins  and  Prazdny  [52],  Prazdny  [67], 
Waxrnan  and  Ullman  [89  and  Bruss  and  Horn  19’.  Longuet-Higgins  and 
Prazdny  examine  ways  of  determining  3D  structure  and  motion  parameters 
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from  optical  flow,  given  an  accurate  reconstruction  of  the  optical  flow  field. 
They  show  that  for  non  planar  surfaces  local  analysis  of  the  flow  field  yields 
a  cubic  constraint  involving  the  motion  parameters.  Prazdny  ([67])  has 
devised  a  five  point  algorithm  to  solve  for  the  motion  parameters  from 
nonlinear  constraint  equations.  Waxman  and  Ullman’s  method  depends 
upon  reconstruction  of  the  optical  flow  field  analytically,  in  local 
neighborhoods.  Bruss  and  Horn  propose  a  least  square  solution  to  the 
parameter  estimation  problem. 

Some  other  computational  approaches  attempt  to  segment  the  optical 
flow  field  into  translational  and  rotational  components,  albeit  approximately. 
An  example  is  the  method  of  Rieger  and  Lawton  [70]  where  the  change  of 
rotational  flow  at  steep  depth  gradients,  is  treated  as  noise.  Jain  [48,  49] 
computes  the  focus  of  expansion  before  computing  the  image  displacements 
and  uses  the  former  to  guide  the  correspondence  for  finding  the  latter. 

All  the  above  methods  compute  the  motion  parameters  from  optical 
flow,  i.e.  continuous  or  differential  image  motion.  An  alternative  approach 
is  to  consider  evaluating  the  motion  parameters  and  3D  structure  from 
discrete  point  correspondence.  Ullman  [86]  shows  that  three  views  of  four 
non  coplanar  points  is  adequate  to  determine  the  structure  and  motion  of 
these  points  under  orthography.  Tsai  and  Huang  [84]  prove  that  the  motion 
of  seven  points  not  lying  on  two  planes,  one  of  which  passes  through  the 
origin,  nor  on  a  cone  passing  through  the  origin,  can  be  uniquely  computed, 
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from  discrete  displacements.  Pang  and  Huang  [28,  29]  prove  that  structure 
and  motion  of  nine  points  not  lying  on  a  second  order  surface  passing 
through  the  origin  is  uniquely  determined  from  image  displacements.  Nagel 
and  Neuman  [60]  and  Roach  and  Aggarwal  [71]  have  also  looked  at  the 
problem  of  determining  motion  from  discrete  displacements. 

Yet  another  approach  to  the  problem  of  motion  parameter  computation 
has  been  to  restrict  the  motion  to  simplify  the  analysis.  Webb  and  Aggarwal 
[92]  Hoffman  and  Flinchbaugh  [41]  and  Hoffman  and  Bennett  [43]  analyze 
rigid  motion  with  the  additional  assumption  of  fixed  axis  of  rotation  or 
planarity.  An  major  motivation  for  this  type  of  analysis  is  that,  it  models  the 
locomotion  of  man  and  animals. 

3.3.  The  Geometry  of  Rigid  Motion 

Consider  a  sensor  moving  relative  to  a  static  scene.  The  coordinate 
frame  (X,Y,Z)  is  fixed  to  the  sensor  (see  figure  3.1).  The  viewing  direction 
is  along  the  positive  z-axis. 

Under  orthography,  the  projection  equation  relating  the  position  of  a 
point  in  three  space  P  =  (X,Y,Z)  to  its  image  p  =(*,y)  is: 

(z,y)  =(X,Y) 

Under  perspective  projection,  the  “image”  is  formed  by  “rays”  from 
points  in  three  space  (i.e.  world  points  ).  These  rays  are  constrained  to  pass 
through  a  nodal  point  called  the  center  of  perspectivity.  The  imaging 


Motion  Constraints 


8 


P  Q  and  R  are  rigid  body  points.  XYZ  is  the  reference  frame.  The 
body  centered  frame  is  at  R.  The  motion  of  R  is  given  by  the 
translational  velocity:  T  *  (U,V,W) 

The  rotational  velocity:  Q  -  ( a .  p ,  y ) 

The  velocity  of  P  is:  ( T  ,♦  a  x  p ) 


Figure  3.1  Representation  of  rigid  motion  parameters 

geometry  is  shown  in  figure  3.2.  The  nodal  point  is  O,  which  is  also  taken 
as  the  origin  of  the  frame  of  reference.  An  image  point  p  =  (x,y) 
corresponds  to  the  world  point  P  =  (X,Y,Z).  Here  the  focal  length  of  the 
imaging  system  is  f. 

The  equation  of  the  ray  OP  is  : 


XYZ 
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surface  (or  a  manifold)  in  3  space.  In  other  words  the  3  cartesian 
coordinates  of  of  a  point  on  a  rigid  body  are  not  independent.  Formally, 


where 


B  ={*,/) 


k  =|(X,  Y,Z)  |  point  on  the  surface  of  B 
f(X,Y,Z)  =0 

When  the  body  B  is  displaced  with  respect  to  the  frame  of  reference, 
we  obtain  a  new  representation 

B'  =(»',/') 

The  displacement  is  described  by  the  affine  transformation 

r=[*]X  +  T  (3-1) 

Any  displacement  of  a  rigid  body  can  be  modeled  by  the  above  equation, 

which  describes  a  rotation  about  an  axis  through  the  origin  and  a  translation 

specified  by  the  vector  T. 

The  rotation  matrix  is  orthonormal  and  its  determinant  is  unity.  Since 
any  matrix  can  be  expressed  as  the  sum  of  a  symmetric  and  a  skew 
symmetric  matrices  uniquely  we  have: 


B  —  Riym  +  R$k€w 

The  axis  of  rotation  is  denoted  by  the  unit  vector  (/,m,n),  whose 
components  are  the  direction  cosines  of  the  axis  of  rotation  and 
/2  +  ms  +  n3  =1.  The  angle  of  rotation  about  this  axis  is  9.  Then,  we  have: 


Motion  Constraints 


84 


la  +  (1  -  t8)  cos#  lm{  1  -  cos#)  n/(l  -  cos#) 

lm(l  -  cos#)  m*  +  ( 1  -  m8)  cos#  mn(l  -  cos#) 
ni(l  -  cos#)  mn(l  -  cos#)  »8  +  (1  -  n8)  cos# 


0  -  n  m 

n  0-1 

I -ml  0 


R*..  = 

If  the  rotation  angle  is  small  with  respect  to  the  precision  of  retinal 


sin# 


measurement,  the  rotation  matrix  can  be  written  in  terms  of  the  three 


component  rotations  about  the  individual  axes  [45].  In  this  case  R  and  T  are 
given  by 


• 

1 

-u>. 

w» 

V 

1 

-u. 

T  = 

-wf 

w. 

1 

t. 

m 

where  u,  =1#,  wf  ==m#  and  u ,  =  »#.  Substituting  for  R  and  T  in  equation 
(3.1)  we  have, 


X1  —  X  —  umY  •+■  (* >fZ  +  tf 

(3-2.1) 

r  =Y  +  w,X  -  w,z  +  t. 

(3-2.2) 

z'  =z  -  utx  +  w, y  +  <, 

(3-2.3) 

AX  —  f,  —  w,  y  + 

(3-3.1) 

Ay  =  tr  +  u,x  -  u,z 

(3-3.2) 

A Z  =t,~  w,X  +  w,y 

(3-3.3) 

where, 


AX  —  X1  -  X  A  Y  =  r  -  Y  AZ  =Z<  -  7. 


Motion  Constraints 


85 


We  define  the  parameter  vector  a  for  characterizing  the  motion,  where 

Motion  perception  involves  the  recovery  of  the  parameters  of  motion,  as 
well  as  the  structure  (or  shape)  of  the  moving  object.  The  geometric 
properties  of  the  three  dimensional  surfaces  and  points  are  related  to  the 
geometry  of  their  image.  Thus  the  projective  transformation  involved  in  the 
image  formation  process  must  be  analyzed.  The  subsequent  analysis 
considers  both  the  cases  of  perspective  as  well  as  orthographic  projections. 

3.4.  Motion  under  Orthography 

When  the  model  of  image  formation  involves  orthographic  or  parallel 
projection,  then  the  mathematical  formulation  of  the  problem  becomes 
considerably  simpler.  It  can  be  argued  that  this  is  a  valid  model  of  image 
formation  when  viewing  distant  objects,  or  when  the  focal  length  of  the 
camera  is  large  compared  to  the  distance  of  the  viewed  surfaces,  or  when 
the  viewing  area  is  small  and  centered  around  the  line  of  sight  -  as  in  the 
case  of  the  field  of  view  corresponding  to  the  fovea  in  the  retina.  Consider 
an  image  point  p  =  (*,»)  projected  by  the  world  point  P  =(X,Y,Z). 
Assuming  that  after  a  short  while  the  point  moves  to  a  position  given  by 
P'  ={r,r,Z>)  while  its  image  moves  to  p'  =[z\y')  the  following  relations  are 
obtained  from  equations  (3.3): 
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Az  =z'  -  z  =  AX’  =  t,  -  u,Y  +  utZ 

Ay  =  y'  -  y  =AK  =  +  w,Jf  -  w,7  V^*^) 

Optical  flow  is  the  time  derivative  of  the  image  position  vector  and  is 

denoted  by  (u,*)  where 


Alternatively, 


(«.*)  =(*.»)  =(*>Y) 


_ lim  Az  _  dx  _  lim  Ay  _  dy 

~~dt 


The  motion  parameters  are  now  the  translational  velocity  VT  =(  U,  V,  W)  and 
the  rotational  velocity  n  where: 


At— 0  ^ 


'  At— 0  At 


_  lim  *• 
W  “At-0^7 


_  lim  wi  a  _  lim  Ut  _  lim  ui 

"  ~Af-0  a7  p  At— 0"aT  1  “Af-o  At* 

therefore  the  equations  relating  image  and  3D  motion  are 

*  ~U  +  0Z  -  Try  (3.5) 

These  equations  are  exactly  identical  in  form  to  those  obtained  under  the 
discrete  case  (assuming  small  rotation),  i,e.  equation  (3.3).  Strictly 
speaking,  according  to  the  nomenclature  adopted  before,  the  motion 
parameters  for  the  discrete  case  are  (t,,t,,w|lwr,w>)  and  those  for  the 
differential  case  are  (U,V,a,0, 7).  However,  since  equations  (3.3)  and  (3.5) 
are  identical  in  form,  all  subsequent  analysis  is  based  on  the  latter  equation. 
Furthermore,  the  parameters  (  it  will  be  evident  later  that  only  the 


rotational  parameters  are  of  interest  here),  in  both  the  differential  as  well  as 
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the  discrete  cases  will  be  referred  to  by  the  symbols  (a,0,y).  The  treatment 
of  both  the  cases  is  identical,  the  only  difference  being  that  derivatives  in 
the  differential  analysis  correspond  to  differences  in  the  discrete  case. 

3.4.1.  On  the  information  available  in  the  optical  flow  field 

Observe  from  equation  (3.5)  that  the  image  displacement  (or  image 
motion  field)  consists  of  a  translational  part  and  a  rotational  part  The 
translational  motion  parameters  are  dependent  on  the  origin  of  reference.  In 
fact  the  parameters,  intrinsic  to  the  motion  are  those  of  rotation.  Thus 
relative  to  a  particular  point,  say  the  origin  (0,0),  equation  (3.5)  becomes: 

tt  =/JZ  -  7y 
t>  =-  a Z  +  7* 

where  «  actually  means  »  -  *(0,0),  *  is  *  -  »(0,0)  and  Z  is  Z  -  Z[ 0,0).  It 
should  be  emphasized  here  that  Z  denotes  depth  relative  to  a  certain  point  of 
reference  (  in  this  case  it  is  the  origin  ).  If  the  structure  or  relative  depth  is 
not  known  then  the  parameters  (0,^,7)  are  not  completely  recoverable. 
There  is  an  exact  analog  of  equation  (3.6)  for  the  discrete  case,  obtainable 
from  equation  (3.3). 

Proposition  I  When  the  depth  function  (or  structure)  is  non  planar  the 
following  parameters  are  uniquely  determined  from  the  image  displacement 
field: 
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(1)  The  rotation  about  the  axis  aligned  with  the  line  of  sight,  i.e.  7. 

(2)  The  ratio  of  the  other  two  parameters,  i.e.  — . 

0 

Proof:  The  proof  is  by  contradiction.  Consider  the  motion  of  the  non  planar 
surface  Zu  which  is  described  by  the  parameters  («i,0i,7i).  The  image 
motion  equations  (from  equation  (3.6)  )  are: 

*  y  (3.7) 

Now  suppose  there  is  another  surface  Za,  whose  motion  is  characterized  by 
the  parameters  (aa,^a,7a),  such  that  the  image  motion  field  in  both  the  cases 
is  the  same.  The  motion  equation  for  the  second  surface  is: 


*  —  02% 2  ~  7aV 
«  =-  0t2 Za  +  72* 

Furthermore,  the  following  relations  hold: 

^7  =7i  -  7*  5^  0 
°i  ,  a 2 

fix  *  02 

From  equations  (3.7)  and  (3.8)  the  following  relations  are  obtained: 


02 Z 2  ~  0\% x  ~  &1V  =0 
-  OjZj  +  OiZj  +  A7*  =0 

Now  since  a,£a  ^  a20i  : 


(3.8) 


(3.9) 


(3.10) 


%x  - 


-  <*2^1  v  +  02&i  * 


Ct20i  -  <*\02 

But  this  is  contrary  to  the  assumption  that  Z,  is  non  planar.  Therefore: 

®1  _  a2 

0\  02 


O 
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Again,  this  implies  (  considering  equation  (3.10)  )  that 


At  =0  or  Tfi  =  7S 
This  completes  the  proof  of  Proposition  I. 


Proposition  II.  The  image  displacement  field  generated  by  a  planar  surface 


is  linear  in  the  arguments  ( * ,y) -  In  addition,  the  parameters  —  and  7  are 

0 


uniquely  determined  by  the  image  displacement  field  if  and  only  if 


ap  +  Pq  =0,  where  (p,j)  is  the  gradient  of  the  planar  surface. 


Proof:  Consider  the  equation  of  the  planar  surface  Z(z,y): 


Z  =pz  +  qy  +  d 


If  the  motion  of  the  surface  is  characterized  by  the  parameters  (0,0,7).  The 


image  motion  (or  optical  flow)  is  given  by: 


.  +  „)  ,,  (3 

v  =  -  a(pz  +  qy)  +  7* 

The  above  equation  indicates  that  for  planar  surfaces  the  optical  flow  is 


linear.  It  is  also  true  that  when  the  optical  flow  is  linear  then  the  moving 


surface  is  planar.  Now  considering  equation  (3.6)  and  substituting  for  (u,t>) 


from  equation  (3.11)  and  rearranging  terms: 


0  =(Pp  -  Pp)z  +  (Pq  -  pq  -  7  +  7 )y 
0  =(-  ap  +  ap  +  7  -  7)*  +  (  -  aq  +  aq)y 
Since  the  above  equations  are  valid  for  the  entire  image  we  have: 


(3.12) 


Pp  —Pp 


(3.13.1) 


Pq  -  7  =0?  -  7 


(3.13.2) 
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op  -  7  =ap  -  7 

(3.13.3) 

aq  —aq' 

(3.13.4) 

Eliminating  p,q  and  7  from  the  above: 


«v>  -  “(f)  = 

or 


-  op) 


P  PP  +  p{fiq  -  op)  -  aq  =0 

where  p  The  above  quadratic  equation  has  a  unique  solution  if  and 

p 


mly  if: 


[f}q  -  ap)J-  4 afipq  =(Pq  +  ap)*  =0 


Under  this  condition: 


a _ a_ 

-= » 


„  P  _  P 

a  s  •  7  — 7»  — “ 

p  0  q  q 

Therefore  the  image  motion  of  planar  surfaces  uniquely  determines  the 


parameters  (-^-,7,—)  if  and  only  if  a p  +  0q  =0. 

P  q 


3.4.2.  Summary  for  the  case  of  orthographic  projection 
What  we  have  shown  is  that: 

(l)  The  analysis  under  orthographic  projection  for  both  differential  and 
discrete  motion  are  nearly  identical. 


(2)  When  the  structure  of  the  moving  object  is  known,  the  motion 
parameters  can  be  computed  uniquely  from  image  motion. 
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(3)  When  the  structure  is  not  known  then  the  recoverable  parameters  are 


(y,7,y).  However  in  this  case,  the  values  are  unique  only  when  the 


moving  surface  is  non  planar,  or  a  certain  condition  (see  proposition  II 
)  holds. 


3.6.  Analysis  of  Rigid  Motion  for  the  Perspective  Projection  Model 


Recall  that  the  projective  relation  between  image  and  world  coordinate 
for  a  point  P  is  given  by, 


_/X.  Y-H 

Z  ’  Z 


(3.14) 


The  above  projection  is  denoted  by  (X,Y,Z)—[z,y,f).  Similarly,  the 
projective  relation  between  another  world  point  P1  and  its  image  is 
(X*, Y',Z')-*(x,,y,,f)  Following  the  schema  used  for  orthographic  projection 
we  proceed  to  derive  the  equations  of  motion  from  first  principles.  Thus 
from  equation  (3.3)  we  have, 


a  i  X, 

ai  —  '  T> 


Y1  Y 

Av  =y'  -  v  =/(yr  -  y) 


or, 


Ax  =/ 
Ay  =/ 


Z  AX-  XA Z 


Z(Z+AZ) 
Z  AT-  YAZ 


(3-15.1) 


(3-15.2) 


Z(Z+AZ) 

Recall  that  when  the  3D  rotation  angles,  characterizing  the  rigid  motion,  are 


“small"  then  the  3D  displacement  components  are  given  by  the  relations: 
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A X=t,~  u,Y  +  u,Z 

Ay  =«,  +  Ui.x-  w,Z 

A Z  =t,~  wtX  +  u.Y 

Thus,  substituting  for  AX',  Ay  and  A Z  in  the  equation  (3.15)  we  obtain  an 
expression  for  the  component  of  the  retinal  displacement, 


or, 


similarly, 


Ax 


Z(t,  +u,,Z  -  U,Y)  -  XU  +  *>,  Y-  w,X) 

f  z2  +  z(t,  +  «,y-  wrx) 


Az 


(/*.  -  zt,)/Z+fut  -  u,y  -  +  wf  — 


1  +  y 


y  * 

"•7"w'7 


Ay  - 


(/^  ~  y<«)/^  -  /<*»,  +  w,x  -  w.-y-  + 


1  +  —  +  w  X  _  „  * 

Z  *  f  "  ^ 

The  above  equations  express  the  the  retinal  displacement  vector  (Ax, Ay)  at 
an  image  point  P  =  (x,y)  in  terms  of  the  parameter  vector  a  and  the  depth 
coordinate  Z  for  corresponding  world  point  p  =  (X,Y,Z).  Another  form  of 
the  above  equations  is, 


Ax  =  • 


(*■-  X)~Z 


y  x 


1  + 


.  xy  x 

/«,  -•  u,v  -  w.y  +  w»y 

,  t  y  x 

1  +  t  +  “’7"  “’7 


(3-16.1) 
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where, 


/  , _ <  /4  ^ 

(*o,*o)  (  Z  *  Z  ' 


Note  that,  when  the  displacement  is  purely  translational 


Ay  (»o  -  ») 
A*  (*0-  *) 


(3-17) 


This  means  that  when  the  rotational  component  of  the  displacement  is  zero, 
the  retinal  displacement  field  converges  to  or  diverges  from  a  single  point 
(z0,yo)-  in  the  image  plane.  This  point  is  called  the  focus  of  contraction 
(FOC)  or  the  focus  of  expansion  (FOE),  depending  on  whether  the 
translational  motion  is  directed  away  from  or  towards  the  image  plane 
(figure  3.3). 

From  the  retinal  displacement  field  due  to  a  particular  motion,  it  is 
possible  to  estimate  the  parameters  characterizing  the  motion.  In  addition, 
if  the  temporal  sampling  rate  of  our  imaging  process  is  high  -  meaning  that 
the  components  of  the  displacement  for  a  single  time  interval  is  small  and 
the  variable  terms  in  the  denominator  of  equation  (3.16)  are  small 
compared  to  unity,  i.e. 


y  +  (A) 

Given  these  assumptions,  it  is  possible  to  derive  the  equations  relating 
image  motion  to  the  motion  parameters  in  the  differential  case.  This  is 
obtained  by  dividing  equation  (3.16)  by  a  small  time  interval,  At,  and  taking 


t 

I 

i 

* 

> 
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"ivt 

:$ 

•Sc 


the  limit  as  At—o.  The  image  displacement  then  becomes  image  velocity, 


and  is  called  optical  flow.  The  optical  flow  is  denoted  by  the  vector  (u,v) 


0 


where: 


•'♦.'"I 


lim  Az  dz 


1 


At— 0  At  it 


lim  A  y  dy 

p-At-0  a t~~dt 


Similarly  the  motion  parameters  are  now  the  translational  velocity 


T  =(U,v,w)  and  the  rotational  velocity  n  =  (a,0,~t)  where: 


rr  _  lim 

U  ~At-0  a? 


v _  lim 

'  -  A  A  .A  •  . 


'  At— 0  a< 


w  _  lim  t 

At— »0  At 


% 

I 


_  lim  „  _  lim 

At— 0  At  ”  —  At— »0  At 


lim 

7  —  At— 0  At 


Equation  (3.6)  now  becomes: 


& 


I 


n-Tttjr mckMVir 
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*  =(*o  -  *)y  +  ffi  -  TV  -  «y  +  0~  (3-18.1) 

«  =(Vo  -  y)-J  -  /°  +  If*  -  «y  +  0~r  (3-18.2) 

where  the  3D  motion  is  now  characterized  by  a  translational  velocity 

(U,V,  W)  and  a  rotational  velocity  (<*,£, 7).  Furthermore  the  FOE  is  now 
given  by  (*0,»o)  =(^,^). 

Motion  perception  involves  the  computation  of  the  parameters  of 
motion  from  the  image  displacement  field.  The  latter,  becomes  in  the 

limiting  case,  a  field  of  velocities,  called  optical  flow.  The  relation  that 

optical  flow  has  with  the  motion  parameters,  is  embodied  in  equations 
(3.18).  These  motion  equations  involve  velocities,  both  in  3D  as  well  as  in 
the  retina.  However,  in  a  practical  vision  system,  the  retinal  measurements 
that  are  actually  made  involve  displacements  over  a  small  time  interval. 
Thus  the  above  velocity  equations,  are  not  strictly  applicable,  but  under 
certain  conditions,  the  penalty  paid  for  doing  this  may  not  be  too  severe. 
This  happens  when  the  error  introduced  by  the  velocity  approximation  is 
sufficiently  small. 

There  are  two  separate  approximations  embodied  in  the  usage  of  the 
equations  (3.18)  to  express  the  constraints  on  image  motion  due  to  the  3D 


motion  parameters: 
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(1)  The  three  dimensional  velocity  approximation  -  The  velocity  of  a  point 
f  =(JT,  Y,Z)  on  a  rigid  body,  moving  with  a  translational  velocity 
T  =(U ,V,W),  and  a  rotational  velocity  n  =(0,^,7)  is  given  by 


T=Tt01' 

Integrating  the  above  with  respect  to  time  we  have 


X,)d, 

Here  X  denotes  the  vector  cross  product.  The  three  dimensional 
velocity  approximation  implies  that,  for  small  At,  the  image 
displacement  can  be  expressed  as: 


A/  =(AX,AY,AZ)  ~  TAt  +  (n  At)  X  p 

(2)  The  retinal  velocity  approximation  -  This  enables  us  to  treat  retinal 
displacements  as  retinal  velocities  and  is  valid  so  long  as  l.  This 

z 

can  also  be  written  as  relation  (A)  stated  previously. 

When  both  the  translational  velocity  T  as  well  as  the  depth  function  Z 
is  multiplied  by  the  same  constant,  the  latter  cancels  out  leaving  the 
equations  (3.18)  unchanged.  The  same  applies  to  the  equations  (3.16). 
This  means  that  scaling  the  translation  by  a  constant  factor,  and  at  the  same 
time,  causing  a  depth  dilation  by  the  same  factor  leaves  the  image 
displacement  field  unchanged.  Thus,  from  the  information  available  in  the 
image  displacement  field,  the  translation  vector  is  only  determined  up  to  a 
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In  equation  (3.16)  the  depth  variable  Z  is  an  unknown.  An  equation 
relating  image  displacement  to  the  motion  parameters  is  obtained  by 

eliminating  from  equations  (3.16): 

Z 


A*(l  +  j  ~  «» j)  ~  (/"»  -  -  w.y-  + 

2 

Ay(l  +  w.y  -  w,  j)  -  (  -  /".  +«.*-  +  u,  j) 


*o  -  *  -  Ax 
»o  -  V  -  Ay 


or, 


~  +  <*>.»/  +  Ax  _  Zq  -  x  -  Ax 

«,^s  -  w,^4  -  w,y/  +  Ay  Vo  -  V  -  Ay 


(3-19) 


where 


=  yA x  +  xy  — /J  +  xAx  +  xa 

=/*  +  yAy  +  jf*  =xAy  +  xy 

The  above  equation  relates  the  motion  parameters  to  the  image 
displacements,  which  are  observables.  This  is  a  bilinear  equation  in  the 
unknown  motion  parameters.  A  similar  relation  is  obtained  for  the 

differential  motion  case,  by  eliminating  —  from  equation  (3.18): 

z 


UP  -  IV-  af  +  P±r)  _  XB_  , 

v  -  (  -  fa  +  7X-  a-j-  +  P^j-)  Vo  V 


(3-20) 


In  the  above  analysis,  the  relations  between  image  motion  and  3D 
motion  has  been  derived  by  assuming  general  displacement  of  a  rigid 
constellation  of  points  in  space.  This  relation  is  given  by  equation  (3.18). 
From  this,  by  taking  the  limiting  case,  for  infinitesimal  displacement,  the 


f 

k 

* 


k 
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continuous  or  differential  motion  case  is  obtained.  The  latter  relation  can 
also  be  obtained  directly  from  the  kinematic  equations  of  rigid  motion  (  see 
Appendix  A  or  [52]  for  details  ). 


3.5.1.  The  Information  available  in  the  image  displacement  field 

The  foregoing  analysis  illustrates  the  dependence  of  the  optical  flow 
field  on  the  motion  parameters.  In  other  words  3D  motion  constrains  image 
motion.  The  magnitude  of  the  translation  parameter  vector  cannot  be 
computed  from  the  optical  flow  field.  The  rigid  motion  parameters 
observable  from  monocular  retinal  optical  flow  measurements  are  given  by 
the  parameter  set  (*oi>oiwnWf ,«,)•  Now,  we  examine  the  motion  equations 
to  see  whether  the  displacement  field  uniquely  determines  the  parameter 
set. 


The  question  to  be  answered,  before  attempting  the  design  of 
algorithms  to  compute  the  motion  parameters  from  optical  flow  is  whether 
such  computation  is  feasible.  This  means  that  given  an  optical  flow  field, 
when  can  we  say  that  it  could  be  produced  by  a  unique  set  of  motion 
parameters.  The  following  theorem  answers  this  question,  by  giving  a 
sufficient  condition  for  uniqueness. 


Theorem  I:  The  optical  flow  field  is  uniquely  determined  by  the  rigid  motion 
parameters  when  the  moving  surface  cannot  be  expressed  as  a  rational  function  of 


the  form  - ,  where  P\  and  Q3  are  polynomials  of  the  first  and  second 
Qt\x<v) 
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orders  respectively,  and(z, jr)  are  image  coordinates. 

Proof:  Let  a  rigid  surface  moving  with  translational  and  rotational 
velocities  ( V\  V,  W')  and  (<*’,£', 7')  respectively,  generate  the  optical  flow  field 
(*,*)  given  by 


U' -  sW' 
Z' 

V*  -  yW' 
Z' 


+  fp-  I'y  -  a'~j~  +  Py 
~  /<*'  +  V*  -  +  0'~ 


(3-21) 


where  the  translation  parameter  vector  is  ( //', V", IV')  and  the  rotational 
velocity  ‘is  (a',0',i'). 

Assume  that  there  is  another  surface  Z(z,y)  moving  with  a  different  set  of 
motion  parameters  but  giving  rise  to  the  same  optical  flow  field  (u,»),  or 


.  1 _ 


U  -  xW 

z 

V  -  yW 
Z 


+  10  -  TV  -  a~T'  +  0^T 


(3-22) 


-  fa  +  7*  -  a~  +  0~ 


where  the  3D  motion  is  now  due  to  a  translational  velocity  [U,V,  W)  and  a 
rotational  velocity  (0,0,7). 

From  equations  (3.21)  and  (3.22): 


U  -  zW  U'  -  zW'  ...  zy  .  **  n 

- z - Y' - +  f  W  ~  V&7  ~  y-Aa  +  ~A0  =0 


V  -  yW  V  -  yW' 


-  f  Aa  +  x  A7  -  +  -y^-A/3  =0 


Z  Z'  '  —  '  / 

where,  Aa  —a  -  a1,  &0  =  0  -  0',  and  A7  =7  -  7’. 


(3-23.1) 

(3-23.2) 


Solving  for  the  variable  Z'  we  have:  (assuming  the  focal  length  f  to  be 
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where 


Z' 


Pii‘,v) 

<?a(  *  .y ) 


Pi(z,v)  =(UV  -  U'V)  +  X(VW' -  VW)  +  V(U'W  -  UW')  (3.24) 


and, 


Qi  =  (tfV  +  6aU)  -  z(AaW  +  A7t/)  -  y(A<9  VT  +  A7 V)  ,  . 

-  zV(AttV  +  &0U)  +  ^(AfiV  +  At  W)  +  y2(Aatf  +  At  VK)  ^  ' 

Equations  (3.24)  and  (3.25)  imply  that  the  surface  Z‘  that  originally 


P 1 


generated  the  optical  flow  must  be  a  rational  function  of  the  form  — — ,  to 

Qi 


permit  ambiguous  interpretation  of  its  rigid  motion.  This  is  contrary  to  the 
the  statement  of  the  theorem.  This  proves  the  theorem. 


Corollary  I:  When  the  motion  of  a  surface  is  purely  rotational,  the  optical  flow 
field  is  uniquely  determined  by  the  motion. 


Proof:  In  equation  (3.23)  make  the  substitutions  U‘  —  V1  =  W'  =0  to  obtain: 

-  +  »*•>-  2.&a +  £*>=« 

-  — — -  /  Aa  +  I  At  -  Aa  +  =0 

£  If 

Now,  eliminating  Z  from  the  above  equations  and  setting  focal  length  ’f  to 
unity,  we  obtain: 


[CtfV  +  A aU)  -  z(ba\V  +  A -,U)  -  y(A &W  +  A7 V) 

-  xy(6aV  +  A fiU)  +  z2(A£K  +  A7  W)  +  y2(A aU  +  A?  W)  =0 
Since  this  equation  must  vanish  for  all  values  of  x  and  y,  the  coefficients  of 

unity,  z,  y,  zy,  z 2  and  y1  must  all  vanish,  giving  rise  to  the  following  six 

equations: 
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AaU  +  A0V  =0 
Aa  V  +  A££/  =0 
A0  V  +  a7  W  =0 
AaU  +  A7W  =0 
Aa  W  +  At  U  =0 
40W  +  A7K  =0 

The  above  equations  imply  either  U  =  V  =  W  =0  or  A a  =A0  =At  =0. 

Both  these  conditions  mean  that  the  optical  flow  field  due  to  a  pure 
rotational  motion  has  a  unique  interpretation.  This  proves  the  corollary. 

Corollary  II:  It  is  possible  for  a  flow  field  generated  by  pure  translational  motion 
to  be  identical  to  one  generated  by  another  flow  field  d  both  translation  and 
rotation.  In  other  words  convergence  of  the  flow  vectors  directly  onto  a  point  on 
the  image  plane  does  not  imply  purely  translational  motion. 

The  truth  of  the  above  corollary  will  be  demonstrated  by  a  numerical 
example.  Consider  two  flow  fields  generated  by  different  surfaces 
undergoing  different  motions: 

In  the  first  case  the  motion  is  due  to  a  planar  surface  given  by  the  equation  : 

Z=J~X~  —Y  +  1 
2  6 

The  motion  is  rigid  and  is  specified  by 

7  35 

(  *o  =  -  y.  Vo  =-J-’  a  =5>  0  =3.  7  =0  ) 

Assume  the  translation  in  depth  to  be  unity.  Then,  from  equation  (3.18) 

we  have, 

u=<x  -  |-)(i  -  y*  +  |-y)  -  3+5 xv  -  3 z2 
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* 


3  35  35  7  j  1 

7'  TT11  T1"  -  7'  +T 


V 


139  35  2 

ir'+T1'  - 


5_ 

6 


In  the  second  case  the  motion  is  due  to  the  planar  surface  given  by  the 
equation  : 


and  the  motion  is  specified  by  the  parameter  vector 

{*o  =  -  VO  =p  a  =0,  p  =0,  7  =0) 

The  optical  flow  field  in  both  the  examples  are  identical. 

The  question  of  multiple  interpretations  of  the  same  flow  field,  has 
received  some  attention  in  the  literature.  The  foregoing  example  illustrates 
the  fact  that  motion  of  planes  can  be  potentially  open  to  more  than  one 
interpretation.  It  is  known  (  see  [81-83,  89])  that  the  motion  of  planes  have 
dual  interpretations.  Uniqueness  of  interpretation  for  planes  requires  three 
views  of  four  points,  or  two  views  of  seven  points  which  uniquely  define 
two  planes  neither  of  which  pass  through  the  origin.  In  another  study  Fang 
and  Huang  [28]  showed  that  nine  points  not  lying  on  a  second  order  surface 
passing  through  the  origin  can  be  used  to  determine  the  motion  parameters 
uniquely.  Another  significant  theoretical  result  is  due  to  Longuet-Higgins 
[53],  and  Tsai  and  Huang  [84],  where  eight  points  are  used  to  solve  for  the 
motion  parameters  from  a  set  of  linear  equations.  The  important  question 
as  yet  unanswered  are,  under  what  conditions  the  optical  flow  field  is 
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inherently  ambiguous  and,  what  is  the  degree  of  the  ambiguity  possible  in 


optical  flow  fields.  The  following  analysis  answers  these  questions. 


Theorem  II.  Under  the  assumption  of  rigidity,  an  optical  flow  field  is  amenable 


to  at  most  three  interpretations. 


Proof:  Theorem  I  shows  that  the  optical  flow  field  is  enough  to  determine 


the  rigid  motion  parameters  uniquely  for  most  surfaces.  It  was  seen 


however  that  in  case  of  certain  rational  functions  there  is  potential 


ambiguity  in  the  interpretation  of  motion.  These  ar the  rational  functions 


of  the  form 


ax  4-  by  +  e 


d  +  ex  +  fy  +  gxy  +  kx  2  +  .Vs 


(3.26) 


Planar  surfaces  belong  to  the  above  class  of  surfaces.  It  has  been 


mentioned  previously  that  planar  surfaces  can  have  at  most  two 


interpretations.  When  a  surface  is  non  planar,  to  have  multiple 


interpretations  of  its  motion,  it  must  be  of  the  type  given  by  equation  (3.26) 


with  the  added  property  that  there  is  no  common  factor  between  the 


numerator  and  the  denominator. 


Let  such  a  surface  be  undergoing  rigid  motion  ( U\  V,  W',a,0,^).  Let  there  be 


another  motion  (U,V,W,a  +  &a,f)  +  A +  At)  that  produces  an  identical  flow 


field.  Then  comparing  with  equation  (3.24)  we  have 


V  W  -  VW'  =  ak 
UW<  -  U'W  =bk 
U'V  -  UV'  =  ek 


(3.27) 
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where  k  is  some  constant  factor.  Since  by  definition  of  the  class  Rj1  at  least 
one  of  a,  b  and  e  must  be  non  zero,  therefore  i^O.  This  is  because  if  k  is 
zero  then  from  the  above  set  of  three  equations  we  get  the  result  that  the 
translations  (U',Vi,W')  and  (U,V,W)  are  identical  up  to  a  scale  factor.  Hence 
by  Lemma  I  of  Appendix  I,  the  motion  is  not  ambiguous. 

Multiplying  the  first  equation  by  U',  the  second  by  V",  and  the  third  by 
W'  and  adding  the  three  equations  we  have 

( aU1  +  bV'  +  eW')-k  =0 

This  means  that  the  motion  can  only  be  ambiguous  when 

aU'  +  bV1  +  eW'  =0  (3.28) 

Similarly  it  can  be  shown  that 

aU  +  bV  +  eW  =  0  (3.20) 

Again  comparing  the  denominator  of  the  rational  function  in  equation 

(3.26)  with  equation  (3.25),  and  combining  the  constant  k  with  the 

translation  parameter  [U,V,W): 


A/9V  +  Aa  U  =d 

(3.30) 

6aW  +  A-r(7  =-  e 

(3.31) 

&0W  +  AqV  =-  / 

(3.32) 

Aft  V  +  A/9  U  =-  g 

(3.33) 

A/9  V  +  Aq  W  =  h 

(3.34) 

Aft  U  +  Aq  W  = » 

(3.35) 

From  equations  (3.30),  (3.34)  and  (3.35)  we  get: 


_  ir  '  V 


•\  T-  <■ 
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2  Aa  U  =?  (3.36) 

2A 5V=r  (3.37) 

2A7W'=4  (3.38) 


where  f  =  d  +  i  -  k,  r  =d  -  i  +  k,  $  =-  i  +  »  +  k.  Substituting  from  the 
above  equations  into  equations  (3.31),  (3.32)  and  (3.33): 

qV*  +  rU*  +  2gUV  =0  (3.39) 

f^i  +  iV5  +  2fVW  =0  (3.40) 

?t/2  +  tW3  +  2e£/W  =0  (3.41) 

Equations  (3.39),  (3.40)  and  (3.41),  together  with  equation  (3.28)  can  admit 

no  more  than  two  solutions.  This  is  because  at  least  one  of  (?,r, #,«,/,?) 

must  be  nonzero.  Therefore,  since  there  can  be  at  the  most  two  spurious 

solutions  (recall  that  the  veridical  solution  corresponds  to  k  =0),  the 

implication  is  that: 

When  the  optical  flow  field  has  more  than  one  interpretation,  the  number  of 
globally  consistent  solutions  for  the  motion  parameters  can  be  at  most  three. 

This  completes  the  proof  of  the  theorem. 

It  will  be  shown  that  there  exist  surfaces  whose  rigid  motion  induces 
optical  flow  that  is  compatible  with  three  distinct  interpretations.  This  fact 
explains  why  Longuet^Higgins  and  Prazdny  { 52 ]  noted,  that  from  local 
optical  flow  constraints  and  their  derivatives  three  interpretations  of  the 


•awMfaiaaia 
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motion  are  possible  since  the  constraint  equations  were  cubic. 

An  example  of  2D  motion  field  with  three  distinct  rigid  motion  interpretations : 
The  equation  of  the  moving  surface  is 


the  motion  parameters  are  (IV, V, o,a,fl,i)  the  expression  for  optical  flow  is 


therefore 


«  =  lPgzy  -  or zy  +  0(x3  +  l)-7y 
v  =  Vgzy  -  ar(yl  +  1)  +  fixy  +  7* 
Alternative  interpretation  I: 


Y^j/^xy-  V(*»  +  i)] 

where  the  motion  parameters  are  (U, 0,0, a, 13  +  gV,i)-  The  optical  flow  field  is 
given  by 


u,  =  U-^[lfxy  -  V(z3  +  1)]  -  axy  +  (/?  +  gV)(x3  +  1)  -  -yy 
=-  a(y 3  +  1)  +  (P  +  fV)zy  +  7* 

Alternative  interpretation  II: 


L^l-lVzy-  lP(y3+\)) 

The  motion  parameters  are  (0,  V,0,a  -  glP  ,/3, 7).  The  optical  flow  field  is 
ttj  =-  (a  -  gV)zy  +  0(x3  +  1)- 7 V 

v3  =  V-^[  Vxy  -  IP  ( y3  +  1)]  -  (a  -  glP)(y3  +  1)  +  £*y  +  7* 

It  is  easily  verified  that  u  =  u,  =u2  and  v  =t»,  =v2. 
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Theorem  I  states  that  under  certain  cases  the  optical  flow  field  may  not 
indicate  the  motion  parameters  uniquely.  The  next  theorem  shows  how 
unambiguous  determination  of  the  motion  parameters  can  be  achieved  from 
optical  flow  data. 

Theorem  Ills  The  ambiguity  of  the  optical  flow  field  disappears  when  the 
observation  period  extends  over  more  than  two  time  instants,  assuming  that  the 
motion  m  three  space  is  steady. 

Proofs  The  term  steady  motion  indicates  that  the  direction  of  translation  is 
fixed  in  space  with  respect  to  any  inertial  frame  of  reference.  In  other 
words,  the  observer’s  line  of  trajectory  is  a  straight  line. 

The  proof  of  the  theorem  follows  straightforwardly  from  equations  (3.28) 
and  (3.29).  Those  equations  tell  us  that  ambiguity  can  only  occur  when  the 
direction  of  translation  lies  on  the  plane  tangent  to  the  observed  surface  at 
the  origin.  Since  this  condition  must  necessarily  be  maintained,  in  order  to 
preserve  ambiguity,  we  can  state: 

To  maintain  ambiguity,  the  spatial  trajectory  of  the  observer's  nodal  point 
(i.e.  origin  of  the  frame  of  observation)  must  lie  on  the  observed  surface. 

Since  the  observer’s  trajectory  is  a  straight  line,  the  above  condition  implies: 

(a)  The  surface  is  planar. 

(b)  It  is  developable,  i.e.  one  of  the  principal  curvatures  vanishes  at  all 
points  (e.g.  a  cylindrical  surface). 
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In  the  first  case  it  can  be  shown  that  the  ambiguity  cannot  be  sustained  [83]. 
In  the  second  case,  the  direction  of  translation  must  be  along  the  principal 
axis  corresponding  to  the  vanishing  principal  curvature.  This  means  that  all 
the  interpretations  must  have  their  translational  velocities  in  the  same 
direction.  Thus  their  rotational  velocities  must  be  identical  (see  appendix 
A).  Hence  the  motion  will  not  have  ambiguity.  This  completes  the  proof  of 
the  theorem. 

Another  way  of  resolving  the  ambiguity  in  the  optical  flow  is  by  using 
shape  information.  There  is  a  strong  relationship  between  the  parameters  of 
motion,  the  optical  flow  field  and  the  structure  of  a  moving  surface.  The 
structure  of  the  surface  is  defined  by  depth  ratios  between  any  pair  of  given 
points  (see  Appendix  B).  The  following  propositions  makes  this  concept 
clear. 

Proposition  I.  When  the  parameters  (i.e.  *o,yoi“.0»7  )  describing  the  motion  of  a 
rigid  surface  are  known  then  the  structure  of  the  surface  is  uniquely  determined 
from  the  optical  flow  field. 

Proof:  The  proof  is  evident  from  equation  (3.18).  Note  that  we  can  obtain 
the  depth  function  up  to  a  constant  dilation  factor  W.  In  other  words  the 
ratio  of  depths  at  any  two  image  points  can  be  computed. 

Proposition  II.  When  the  structure  of  a  surface  is  known  then  the  parameters 
describing  its  rigid  motion  are  uniquely  obtained  from  the  optical  flow  generated 
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by  the  motion. 

Proofs  See  Appendix  II. 

Even  the  partial  specification  of  shape  can  lead  to  a  correct  perception 
of  rigid  motion.  A  illustration  of  the  fact  that  shape  information  can 
disambiguate  between  alternative  motion  interpretations  comes  from  the 
next  theorem. 

Theorem  IV:  Ihe  motion  of  a  planar  surface  whose  direction  of  translation  does 
not  lie  in  the  plane  of  its  surface  normal  and  the  line  of  sight,  can  be  interpreted 
correctly  from  the  optical  flow  generated,  when  the  tilt  of  the  plane  is  known. 

Proof:  Let  the  equation  of  the  planar  surface  be 

d 

l  -  px  -  qy 

where  (p,q)  is  the  orientation  of  the  depth  plane  and  ’d’  is  the  distance 
from  the  origin  along  the  z  axis  (e.g.  line  of  sight ).  Substituting  the  above 
into  equation  (3.18)  and  observing  that  we  can  ignore  multiplication  of  the 
translational  parameters  by  a  constant  (such  as  d  )  since  we  can  compute  the 
former  up  to  a  scale  factor  anyway,  we  have: 

«  =lx  -  l2x  -  l2y  +  ltxy  +  /jj2 
v  =l9-  l7x  -  l,y  -lsxy  +  lAy2 
where  the  unknowns  {  a,  }  are  given  by 


(3.42) 
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Up  +  W  =/j 

(3.43.2) 

Uq  +  7  =fs 

(3.43.3) 

Wg  -  or  =lt 

(3-.43.4) 

Wp+0  =lt 

(3.43.5) 

V-  a=t. 

(3.43.6) 

7  -  Vp=l7 

(3.43.7) 

Vq  +  W  = l , 

(3.43.8) 

If  we  can  estimate  the  synthetic  parameters  f  -*v  mg  measurements  of 

optical  flow  at  a  minimum  of  fo  points,  in  the  image  and,  in 

addition  can  measure  the  tilt  of  the  depth  plane,  i.e. 


9 

Then  from  (3-43.7)  and  (3-43.8)  and  (3-44)  we  have: 


(3.44) 


T  +  tW  =l7  +  rl,  (3.45.1) 

From  (3-43.2),  (3-43.3)  and  (3-44)  we  have  : 

r7-  W  =r/s-  l3  (3.45.2) 

Therefore,  since  rJ  +  l^o  we  have: 


h  +  r(f*  -  fj)  + 


w  = 


/j  -I-  r(/7  -  ij)  ■+•  rJi, 


r*  +  1 


(3.45.3.1) 

(3.45.3.2) 


Now  if  W  lt{  i.e.  q  ^  0  )  we  have  from  (3-43.8)  and  (3-43.3): 


Uq  U  _  h  -  7 

Vq  V  lt-  W 

otherwise  if  l7  ^  0  (i.e.  p  ^  0  )  we  have  from  equations  (3-43.7) 


(3.45.4) 
and  (3- 


43.2): 
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Up  = j/  _h 

vP  v 


u-  w 


1  -  h 


=k 


(3-. 45.4) 


(if  both  p  and  q  are  zero  then  the  parameters  are  easily  solved  for  ) 

Now  from  (3-45.4),  (3-43.8)  and  (3-43.1)  we  have: 

la  +  fi  =1  -  klt 

Also  from  (3-43.5)  and  (3-43.4)  we  have: 


(3.45.5) 


ra  +  f)  =/t  -  r/«  (3.45.6) 

Therefore,  since  r  ^  k,  from  the  assumption  made  in  the  statement  of  the 

theorem,  then  equations  (3-45.5)  and  (3-45.6)  are  independent,  and  we 

have: 


a  =• 


(1  -  */,)  -  (U  -  Tl<) 
k  -  T 

k(lt-  r/«)  -  r(l  -  */,) 
k  -  r 


(3.45.7.1) 


(3.45.7.2) 


Now  U  and  V  can  be  determined  from  equations  (3-43.6)  and  (3-43.1). 
Thus  we  have  determined  the  motion  parameters  uniquely  from  the  optical 
flow  and  tilt  information. 

At  this  point  it  may  be  mentioned  in  passing  that  it  is  possible  to  obtain 
the  motion  parameters  uniquely  from  the  optical  flow  generated  by  two 
planes  moving  together  rigidly.  In  this  case  the  optical  flow  is  locally  second 
order.  If  the  eight  synthetic  parameters  are  now  measured  at  two  different 
regions  of  the  flow  field  then 
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U  A-^-  =  A/, 

d 

1/a4  +  =Aij 

a  a 

t/A-^  =  A/j 

a 

wa4  =a/« 

d 

Wa4  =A/j 
a 

Vii=4i. 

-  Va4=A/t 

ft 

Va4  +  VVa4  =  A/, 

a  a 

where  the  two  planes  involved  in  the  motion  are  given  by  z 


(3.46) 


d 

px  +  fljf  +  1 


Jl 

and  z  =  — - ; - .  The  A  operator  in  front  of  any  quantity  denotes  the 

p'x  +  q'y  +  1 

difference  of  the  corresponding  parameters  for  the  two  planes,  e.g. 


The  above  equations  imply  that  when  at  least  one  of,  a4  or  a4  or  a4  is 

add 


non  zero  the  translational  parameters  are  uniquely  determined.  Hence  in 
such  a  case  the  rigid  motion  parameters  are  determined  uniquely  from  the 
optical  flow  field  (see  Appendix  A).  Therefore 


When  two  planes,  neither  of  which  pass  through  the  origin,  move  rigidly 
together,  their  motion  is  uniquely  determinable  from  the  optical  flow  field 
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3.6.2.  Summary  of  the  perspective  projection  ease 

The  analysis  presented  here  leads  to  considerable  insight  into  the  3D 
motion  interpretation  problem.  Previous  results  (e.g.  [28,  84])  by  Huang 
and  his  colleagues  presented  sufficient  conditions  for  uniqueness  of  three 
dimensional  motion  interpretation,  since,  they  were  concerned  with  specific 
algorithms.  The  work,  reported  here,  deals  with  necessary  conditions  for 
unique  interpretation  of  3D  motion  from  the  optical  flow  field. 

While  the  surface  denoted  by  equation  (3.26)  does  mean  second  order 
surfaces  containing  the  nodal  point  of  the  camera,  it  is  certainly  true  that  all 
such  surfaces  do  not  admit  ambiguous  interpretations  of  their  3D  motions. 

f 

Multiple  interpretations  require,  in  addition,  that  the  the  constraints  given 
by  (3.29),  (3.39),  (3.40)  and  (3.41)  all  be  satisfied. 

Thus  consider,  an  algorithm,  such  as  Prazdny’s  [67],  where  nonlinear 
(and  independent)  flow  constraints  at  five  retinal  locations  are  used  to 
obtain  a  3D  motion  interpretation.  It  is  now  possible  to  answer  the  question 
as  to  whether  the  solution  obtained  is  the  only  one  possible.  Since  now  a 
set  of  motion  parameters  is  known,  from  equation  (3.22)  the  relative  depth 

—  can  be  obtained  at  the  five  retinal  locations.  The  latter,  when 

substituted  into  equation  (3.26),  generates  five  linear  equations  in  the 
surface  parameters  a,b,c,d,e,f  These  together  with  the  four 

constraints  (3.29),  (3.39),  (3.40)  and  (3.41)  constitute  nine  linear 
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homogeneous  equations  in  the  nine  surface  parameters.  Therefore 
uniqueness  of  interpretation  is  possible  if  the  determinant  of  the  above 
system  is  zero.  Which  in  turn  implies,  that  all  the  surface  parameters  must 
be  zero.  This  makes  it  impossible  to  construct  any  other  interpretation  from 
measurements  at  the  five  retinal  locations,  guaranteeing  that  the  solution 
obtained  is  the  only  one  possible. 


3.6.  Summary  of  motion  constraint  results 

Uniqueness  proofs  of  the  type  derived  by  Tsai  and  Huang  and  Fang  and 
Huang  do  not  allow  us  to  visualize  the  situations  when  the  optical  flow  field 
is  intrinsically  ambiguous,  admitting  more  than  one  interpretation.  The 
analysis  of  the  optical  flow  field  to  determine  cases  of  ambiguity  was  a  major 
focus  of  this  chapter.  We  saw  that  three  temporally  contiguous  image 
frames  contain  enough  information  to  uniquely  recover  3-D  motion  and 


structure  under  perspective  projection.  Since  the  optical  flow  field  (two 
temporally  proximal  frames)  is,  in  general,  ambiguous,  two  frames  can 
recover  structure  when  the  moving  surface  satisfies  the  conditions  of 
Theorem  I. 

The  image  formation  geometry  used  in  the  analysis  involved  the 
perspective  projection.  We  also  briefly  examined  an  approximation  of  the 
above  model  called  orthographic  or  parallel  projection.  The  attendant 
simplicity  in  the  motion  constraint  equations  can  be  used  to  considerable 
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advantage  in  the  preliminary  analysis  of  the  motion  perception  problem. 

The  following  results  were  derived: 

1.  The  component  of  rotation  about  the  line  of  sight,  the  ratio  of  the 
other  two  components  of  rotational  velocity,  and  the  tilt  function  is 
uniquely  computable  from  a  single  optical  flow  field,  for  a  rigid  non 
planar  surface. 

2.  When  the  surface  normals  for  a  rigid  surface  are  known  then  the 
motion  parameters  can  be  computed  uniquely. 

The  Perspective  Projection  model  (see  figure  3.2)  is  a  more  accurate  model  of 

image  formation  by  eye  or  camera.  For  this  model  it  is  proved  that: 

1.  The  optical  flow  field,  under  the  assumptions  of  rigidity  can  have  at 
most  three  interpretations. 


2.  The  rigid  motion  of  any  surface 


_ 1 _ 


the  sensor  cannot  be  expressed  by  the  rational  function 


,  where 


Pi  and  Qj  are  rational  functions  of  the  first  and  second  orders 
respectively,  is  uniquely  computable  from  the  information  in  the  optical 
flow  field. 


3.  Two  optical  flow  fields,  obtained  at  different  time  instants,  determine 
the  motion  parameters  uniquely. 
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The  motion  parameters  are  uniquely  determined  from  the  optical  flow 
field  when  the  corresponding  motion  involves  rotation  only. 

The  optical  flow  due  to  planar  surfaces  is  generally  ambiguous. 
However  this  ambiguity  can  be  resolved  either  when  the  flow  field  is 
due  to  more  than  one  plane  moving  together  rigidly,  or  in  the  case  of  a 
single  plane,  if  its  tilt  is  known. 


Chapter  Four 


Algorithms  for  Rigid  Motion  Perception 


4.1.  Introduction 

In  this  chapter  the  applicability  of  the  Hough  Transform  technique  to 
motion  parameter  estimation  is  examined  experimentally. 

The  main  difficulty  in  computing  the  3D  Rigid  Motion  parameters  is 
that  the  equation  constraining  the  image  motion  to  the  3D  motion  is 
nonlinear.  Another  complication  arises  from  the  high  dimensionality  of  the 
parameter  space.  If  it  were  possible  to  separate  the  component  of  the  image 
displacement  due  to  translation  from  that  due  to  rotation  we  could  have 
efficient  algorithms  for  the  computation  of  the  three  dimensional  motion 
parameters. 

The  brute  force  hough  algorithm  [7]  is  seen  to  have  limitations  that 
stem  from  the  above  difficulties.  The  next  section  of  this  chapter  outlines 
the  various  computational  strategies  that  were  adopted  to  overcome  the 


above  problems.  After  this,  algorithms  employing  these  strategies  and  their 


Motion  Algorithms 


118 


experimentally  determined  performance  is  presented.  The  remaining 
portion  of  this  introductory  section  is  a  brief  discussion  of  other  algorithms 
that  have  been  proposed  in  the  literature. 

An  algorithm  due  to  U  liman  uses  a  simplified  situation  where  the 
rotation  axis  is  assumed  to  be  along  the  z  axis  [85].  The  constraint  he 
obtained  was  an  equation  of  the  fourth  degree  in  the  sine  of  the  rotation 
angle.  Roach  and  Aggarwal  derived  a  set  of  nonlinear  constraint  equations 
in  eighteen  parameters  to  characterize  rigid  body  motion  [71].  In  recent 
years,  most  of  the  work  in  motion  interpretation  in  the  literature  attempt 
either  least  square  error  minimization  or  iterative  search  techniques  to 
compute  the  set  of  motion  parameters  that  best  describe  the  image  motion 
data.  The  constraint  equation  used  is  some  form  similar  to  the  one  derived 
in  chapter  three  (also  equation  4.7).  Bruss  and  Horn  [19]  compute 
parameter  set  that  minimizes  the  square  of  the  error  between  the  measured 
optical  flow  and  that  computed  from  the  parameter  constraint  In  general 
such  a  technique  will  give  rise  to  a  system  of  non-linear  equations  from 
which  the  parameters  must  be  computed  using  some  suitable  iteration 
scheme.  LonguetrHiggins  and  Prazdny  [52]  mention  the  possibility  of  using 
motion  parallax  to  simplify  the  computation  of  the  global  motion 
parameters.  Lawton  and  Rieger  [70]  uses  a  similar  idea  to  factor  out  the 
rotational  component  of  the  optical  flow  at  depth  discontinuities  or  regions 
where  the  depth  gradient  is  large.  This  method  is  not  reliable  since  it  hinges 
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upon  the  ability  to  compute  flow  vectors  reasonably  accurately  at 
discontinuities.  Since  almost  all  algorithms,  to  date,  for  computing  optical 
flow  face  problems  at  regions  where  the  field  is  sharply  discontinuous. 

Another  method  is  to  linearize  the  constraint  equation  by  writing 
equation  4.7  as  a  linear  equation  in  eight  parameters.  Obviously  these  eight 
parameters  are  each  functions  of  the  values  of  the  five  actual  parameters. 
This  implies  that  linear  least  square  methods  are  not  applicable  here,  since 
the  eight  synthetic  parameters  are  not  independent  of  one  another.  A  similar 
method  is  used  by  Tsai  and  Huang  [84]  but  they  found  that  the  computation 

is  very  sensitive  to  errors.  Algorithm  V  attempts  to  alleviate  the  problem  of 

> 

high  dimensionality  of  the  above  scheme  by  using  spatio-temporal 
derivatives  of  the  optical  flow  field. 

In  the  case  of  General  Motion,  where  one  or  more  objects  move  with 
respect  to  the  observer,  the  situation  is  complicated  by  the  fact  that  we  have 
to  determine  several  sets  of  parameters,  corresponding  to  the  several  bodies 
in  motion.  However,  the  image  motion  measurement  technique  of  chapter 
two  has  been  found  to  be  quite  good  under  such  circumstances.  This  fact 
enables  us  to  assume  that  the  algorithm  for  motion  parameter  estimation 
can  deal,  without  loss  of  generality,  with  the  motion  of  a  single  body. 
Motion  segmentation  has  also  been  studied  in  restricted  domains  by 
Fennema  &  Thompson  [33  .  The  more  tricky  question  concerning  the 
difference  between  Egomotion  and  General  Motion  has  to  do  with  the  choice 
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of  the  body  frame  of  reference.  In  the  case  of  egomotion  the  camera  frame 
and  the  natural  body  frame  can  be  though  to  coincide.  This  means  that  the 
notion  of  steady  motion  (i.e.  parameters  relatively  unvarying  within  any 
small  period  of  observation)  is  intuitive,  in  the  sense  that  it  implies  steady 
motion  of  the  observer  in  space.  On  the  other  hand  for  the  motion  of  an 
object  the  usual  convention  is  to  chose  the  body  frame  of  reference  to 
coincide  with  the  camera  frame.  In  this  case  the  steady  motion  of  the  body 
in  space  need  not  imply  steadiness  of  the  observed  motion  parameter 
values. 

Recently  a  way  of  determining  motion  parameters  from  three 
dimensional  flow  has  been  suggested  [8j.  This  method  is  amenable  to 
adaptation  to  the  general  motion  case.  It  is  not  clear  as  to  how  difficult  it  is 
to  compute  the  three  dimensional  flow  in  this  case.  However,  it  can  be 
shown  that  in  case  a  depth  map  can  be  obtained  (by  some  stereo  matching 
technique),  the  three  dimensional  flow  map  can  be  calculated. 

Computer  algorithms  for  determining  the  parameters  of  rigid  motion 
are  discussed  in  the  light  of  the  various  constraints  developed  in  the 
previous  chapter.  The  treatment  will  consider  both  orthographic  and 
perspective  projections.  Some  of  the  algorithms  are  described  in  detail,  while 
others  are  outlined,  particularly  when  they  bear  any  similarity  to  one  already 
discussed.  In  the  algorithms  proposed  here,  the  Hough  Transform  technique 
[7]  is  used  to  compute  the  desired  global  parameters  from  sets  of  constraint 


Motion  Algorithms 


121 


equations  obtained  at  different  image  (or  retinal)  locations.  It  should  be 
noted  that  least  square  error  minimization  techniques  are  also  applicable  in 
most  cases. 

Recall  that  the  notation  for  optical  flow  is  («,»).  While  the  translation 
parameters  are  denoted  by  (U,V,W)  or  (*„  =  -^,*0  =4?)  “d  the  rotational 

Yw  W 

parameters  by  (a, fin). 

4.2.  Using  the  Hough  Transform  for  Motion  Parameter  estimation 

The  concept  of  the  hough  transform  is  very  simple.  It  is  closely  related 
to  the  idea  of  clustering,  introduced  in  chapter  two.  Consider  an  example 
problem  where  we  are  required  to  estimate  the  parameters  of  a  straight  line 
in  two  space  from  local  measurements  of  small  edge  segments.  The  form  of 
the  line  equation  we  will  use  is 

x  cob  9  +  y  iin  9  =  p 

and  hence  the  parameters  to  be  estimated  are  (p,9).  The  set  of 
measurements  is  given  by 


M  =  {(  | there  u  an  edge  at  ( z, , y, )  with  orientation  9  } 

Using  the  elements  of  M  we  obtain  a  distribution  H(p,9)  which  denotes  the 

count  of  the  number  of  times  each  of  the  (p,9)  pairs  satisfied  the  line 
constraint  equation  for  all  the  data  values.  This  distribution  is  called  the 
hough  transform  of  the  data  set  M.  The  parameter  estimate  (p‘,9‘)  is  then 
given  by  the  mode  of  the  distribution  //().  The  situation  is  depicted  in 
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figure  4.1. 


Hough  Transform  is  defined  in  the  (p,8)  parameter  space.  An  important 


aspect  of  the  method  is  the  necessity  of  quantizing  (or  discretizing)  the 


parameter  space  in  order  to  implement  the  transform  process  by  computer 


(or  by  a  hardwired  connectionist  network  [31]).  The  degree  of  quantization 


is,  in  most  cases,  a  critical  control  variable.  The  quantization  can  be 


visualized  by  imagining  the  parameter  space  to  be  covered  by  a  set  of  cells 


that  collect  evidence  or  votes  from  the  data  values  in  order  to  determine 


whether  the  desired  parameter  set  lies  in  the  space  spanned  by  the  cell. 


Figure  4.1  Parameter  estimation  by  Hough  Transform 
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An  alternative  formulation  arises  when  we  are  unable  to  measure  the 
orientation  8  of  the  edge  elements.  In  this  case 

M  ={{*,.».•)  |<A«re  it  an  edge  at  (*,,?,)} 

Therefore,  every  (*,,&•)  determines  a  constraint  surface  in  the  parameter 
space  : 

x, -cob  8  +  y,gin  8  =p 

Thus  the  transform  is  obtained  by  voting  for  every  cell  in  the  transform 
space  that  “satisfies”  the  constraint  for  a  given  data  element.  Again  the 
estimated  parameter  set  (p‘,8‘)  is  obtained  by  taking  the  mode  of  the 
resulting  distribution  H(p,8). 

0 

The  motion  parameters  that  are  to  be  estimated  are: 

(  *o  >  Vo  »  «  .  P  »  7  ) 

The  measured  data  is  the  optical  flow  field  [« (z,y),v[x,y)].  In  order  to  use 
the  hough  transform  method  for  to  tackle  the  motion  perception,  the 
following  issues  have  to  be  addressed: 

(i)  What  does  it  mean  for  the  data  to  satisfy  a  constraint?  This  question  is 
important  since  we  have  to  contend  with  nonlinearity,  discretization 
and  noise.  Thus  the  data  may  never  exactly  satisfy  the  constraint 

(ii)  At  what  coarseness  level  should  the  parameter  space  be  quantized. 

(iii)  How  does  nonlinearity  affect  the  first  two  issues. 
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In  order  to  represent  the  the  parameter  space  one  has  to  be  able  to 
determine  the  bounds  of  the  plausible  parameter  values.  This  does  not  seem 
an  unreasonable  demand,  however  a  more  critical  factor  is  the  quantization 
of  the  space.  In  of  nonlinear  constraints  small  discretization  errors  may 
cause  large  fluctuations  in  the  constraint  surface.  Hence  with  coarse 
discretization  the  issue  of  “constraint  satisfaction”  may  be  difficult  to 
determine.  This  is  reflected  by  the  results  obtained  from  algorithm  III. 

A  heuristic  solution  to  this  problem  is  to  stipulate  that  constraint 
satisfaction  implies  that  the  constraint  surface  intersects  the  parameter  cell 
in  question.  This  leads  to  a  simple  scheme  to  determine  intersection,  in  the 
case  of  linear  surfaces,  whose  distance  from  the  cell  center  can  be 
determined  by  substituting  the  cell  center  parameter  values  in  the  constraint 
equation.  This  is  however  not  possible  in  the  case  of  nonlinear  constraint 
surfaces  (figure  4.2). 

The  great  advantage  of  this  method  is  that  very  coarse  quantization  of  the 
parameter  space  is  possible.  The  only  problem  with  this  intersection 
strategy  is  that  when  the  cells  are  large  the  distribution  obtained  may  be 
multimodal.  This  situation  is  depicted  in  figure  4.3.  In  this  case  the  spurious 
modes  have  to  be  eliminated  by  successive  refinement  by  considering 
particular  candidate  cells  and  splitting  them  into  sub  cells  and  repeating  the 
voting  process.  This  strategy  is  used  in  algorithm  V  and  a  modified  version 


of  algorithm  III. 
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In  some  of  the  subsequent  algorithms  the  five  dimensional  parameter 
space  is  subdivided  into  a  translational  subspace  and  a  rotational  subspace. 
The  first  subspace  is  quantized  in  terms  of  rectangular  cells,  while  in  the 
case  of  the  rotational  space  we  have  used  a  gaussian  sphere  representation, 
using  geodesic  tessellations,  to  span  the  directions  in  three  space 
corresponding  to  the  axis  of  rotation. 

The  results  reported  in  this  chapter  indicate  that  hough  transform  can 
be  a  reliable  and  robust  method  for  motion  parameter  estimation.  The 
problem  of  nonlinearity  cannot  be  totally  be  removed,  necessitating  the 
knowledge  of  some  initial  estimate  of  the  parameter  solution  set.  without 
this  it  becomes  difficult  to  label  the  parameter  cells  in  the  transform  space 
and  requiring  a  large  number  of  such  cells. 


Figure  4.3  Spurious  mode  formation  in  the  cell  intersection  scheme 


4.3.  Motion  under  Orthography 

For  the  sake  of  completeness  the  case  of  orthographic  projection  is 
considered  in  this  section.  This  is  ,  however  a  restrictive  situation,  which  is 
approximates  the  imaging  geometry  when  the  imaged  object  is  either  very 
far  away,  or  the  focal  length  of  the  lens  is  large.  This  case  has  been 
analyzed  extensi  sly  in  the  literature,  [5,  42,  79],  are  some  examples.  The 
above  methods  deal  with  local  analysis  of  the  image  motion  field.  The 
algorithms  presented  in  this  section  are  based  on  the  uniqueness  results  of 
chapter  three  and  involve  global  analysis  of  the  optical  flow  field. 

Under  orthography  the  translational  part  of  the  optical  flow  field  is 
constant  and  hence  the  translational  parameters  are  not  computable.  Hence 
motion  parameters  here,  always  refer  to  the  rotational  velocity  parameters 


r*  BV*j 
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The  relevant  equations  are 


Au  =  fj  A2  -  *7  Ay 
Av  =-  a  Ax  -+■  ~i Ax 


where  the  A  symbol  denotes  that  the  following  quantity  is  a  difference 
obtained  from  measurements  made  at  two  different  retinal  locations.  The 
relation  between  the  surface  gradients  and  the  optical  flow  derivatives  are: 


du_  3Z 

dx  dx 

(4.2.1) 

dv  dZ 

«r“-  a*r  +  '1 

(4.2.2) 

du  a  dZ 

dy  *  dy  ~  7 

(4.2.3) 

dv  _  dZ 

dy  a  dy 

(4.2.4) 

Algorithm  Is  Motion  parameters  from  image  motion  and  structure  information. 
The  simplest  instance  is  when  the  structure  of  the  moving  object  is  known. 
In  the  discrete  case  the  relative  depth  function,  A Z(x,y),  values  are  enough 
to  compute  the  parameters  (a,0,i)  uniquely  from  the  linear  equation  (4.1). 
For  the  differential  case  structure  or  shape  can  be  represented  by  the  surface 
dZ  d  Z 

normals  (——,——).  If  the  surface  normals  are  known  everywhere,  then  we 

ox  dy 

can  integrate  the  surface  normals  to  obtain  the  depth  up  to  a  constant 
additive  term.  In  other  words  A Z(x,y)  is  computable.  In  this  case 
measurement  of  optical  at  three  non  collinear  points  is  enough  to  compute 
the  rotational  parameters.  However,  if  the  surface  normals  are  only  known 
at  sparse  locations,  but  the  optical  flow  field  is  locally  known  at  these 
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locations  then  we  can  use  equation  (4.2)  for  computing  the  rotation 
parameters.  In  this  case  we  are  relying  on  the  fact  that  the  first  derivatives 
of  the  flow  can  be  reliably  computed.  This  is  possible  when,  in  the 
neighborhood  of  the  points  of  interest,  the  optical  flow  values  have  been 
measured  at  enough  locations  so  as  to  allow  analytic  reconstruction  of  the 
optical  flow  function.  Finally  note  that,  if  the  motion  parameters  are  known 
then  the  structure  can  be  obtained  from  the  image  motion  for  both  the 
discrete  and  the  differential  cases.  The  steps  in  the  algorithm  are: 

1.  Set  up  a  three  dimensional  accumulator  array  for  the  rotation 
parameters:  A[ar,£,7]:=fl. 


2. 


For  every  point  in  the  image  where  optical  flow  andjsurface  normals  are 
known,  select  the  constraint  equation  (4.1)  if  the  estimated 
measurement  error  in  the  surface  normal  function  is  less  than  that 
estimated  for  the  optical  flow  function;  otherwise  select  equation  (4.2). 

For  all  values  of  (a, £,7): 

If  (a, ^,7)  satisfies  the  constraint  equation  selected 
A[a,£,7]:=A(a,£,7]  +  1 


3.  Obtain  the  maximum  value  in  the  accumulator  array.  The 

corresponding  indices  are  the  desired  values  for  the  rotation 


parameters. 
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Algorithm  II:  Motion  parameters  and  structure  from  image  motion.  When  the 
structure  is  not  known  then,  considering  the  differential  case  and 

eliminating  >-^— )  from  equations  (4.2)  : 


dv  ad u 

dx  P  dx  ** 

(4.3.1) 

dv  P  dv 

dy  a  dy  7 

Similarly,  eliminating  tJ,  from  equation  (4.1): 

(4.3.2) 

ft*  -  ix  +  piy  +  v  =0 

(4.4) 

where  p 


a 

T 


It  is  easy  to  obtain  quadratic  equations  in  either 


7  or 


XIULU  1/1 1C  CVjUdWlUUS 


(4.3).  This  means  that  in  general,  at  every  image  location,  from  the 
measurement  of  the  spatial  derivatives  of  the  optical  flow  at  most  two  sets 

of  values  of  the  parameters  (-§-,7,— )  may  be  obtained.  However,  if  some 

P  9 

global  assimilation  technique,  like  the  hough  transform  (  see  [7]  )  is  used, 
then,  as  shown  previously,  if  the  moving  surface  is  non  planar,  only  one  set 
of  parameters  will  be  globally  consistent.  An  exactly  similar  method,  but 
using  differences  of  image  displacements,  can  be  devised  for  the  discrete 
case  starting  from  equation  (4.4), 


4.4.  Motion  under  Perspective  projection 

The  relation  between  the  optical  flow-  and  the  motion  parameters  is  given  by 


the  equation: 
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U  -  xW 
Z 


-  azy  +  ${x2  +  1)  -  Try 


„  —  V  ~W  -  a{y 2  +  1)  +  0zy  +  71 

From  the  above  we  obtain,  by  eliminating  Z: 


(4.5) 


o  -f  qgy  -  fl(z2  +jj_+jy  _  U_-  xW  .  . 

v  +  a(y*  +  1)  -  fixy  +  7*  V  -  yW  '  '  ' 

Observe  from  the  right  hand  side  of  the  above  equation,  that  its  value  is 
unchanged  when  the  translational  parameters  are  multiplied  by  some 
constant.  Hence  we  can  determine  the  translational  parameters  only  up  to  a 
scale  factor.  If  we  assume  that  W  0  then  the  previous  equation  can  be 
written  as: 


tt  +  azy  -  fi(x2  +  1)  +  7y  x0  -  x 

v  +  a(y2  +  1)  -  fizy  +  7*  Vo  -  V 


(4.7) 


If  w  =0  then  (4.6)  reduces  to: 


a  +  azy  -  0(z2  +  1)  +  7y  U_ 

v  +  a(y2  +  l)  -  fizy  +71  V 


(4.8) 


Equations  (4.6),  (4.7)  and  (4.8)  are  bilinear  in  the  translation  and  the 
rotation  parameters.  This  nonlinearity  makes  it  difficult  to  combine 
constraints  from  different  image  locations  to  compute  the  motion 
parameters.  To  summarize,  the  problems  with  computation  of  motion 
parameters  are: 

1.  The  constraint  equations  are  nonlinear. 

2.  The  parameter  space  is  of  high  (e.g.  five)  dimensionality. 
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Algorithm  III:  Hough  transform  m  5D  parameter  space.  This  type  of 
algorithm  can  be  easily  realized  by  simple  parallel  neuronal  hardware  (see 
[31]).  The  parameters  that  are  to  be  determined  are  the  polar  angles  (or 
direction  cosines)  representing  the  directions  of  translation  and  rotation,  and 
the  magnitude  of  the  rotation  vector.  This  representation  for  the  rigid 
motion  parameters  is  convenient  since  the  parameter  subspaces 
representing  directions  in  space  become  easy  to  quantize  by  means  such  as 
geodesic  tessellation  of  the  gaussian  sphere.  The  steps  in  the  algorithm  are: 

1.  Select  a  coarseness  scale  for  the  parameter  subspaces.  For  instance,  how 
many  distinct  directions  in  space,  the  range  of  values  estimated  for  the 
rotation  magnitude  and  the  sampling  interval  in  this  range.  Initialize 
the  parameter  units  belonging  to  the  hough  transform  space  (this  is  the 
five  dimensional  accumulator  array  where  the  votes  for  every  parameter 
vector  is  tallied). 

2.  For  all  retinal  locations  where  optical  flow  has  been  measured  do  step  3: 

3.  For  all  possible  parameter  values  (i.e.  values  of  the  parameter 
quintuple)  admitted  in  step  1,  do: 

(i)  If  the  direction  of  the  translational  velocity  is  not  parallel 
to  the  image  plane  select  equation  (4.7)  else  select  equation 
(4.8). 

(ii)  If  the  parameter  values  satisfy  the  chosen  constraint 
equation  vote  for  the  corresponding  parameter  vector. 
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4.  Find  the  parameter  quintuple  that  has  received  the  maximum  number 
of  votes. 

5.  Restrict  the  parameter  space  to  a  neighborhood  of  the  selected 
parameter  quintuple.  Repeat  the  steps  from  2  to  4  after  choosing  a  finer 
parameter  space  quantization. 

6.  If  the  error  due  to  the  parameter  quantization  is  acceptable  then  stop 
and  return  the  parameter  values  computed.  Otherwise  repeat  step  5. 

Some  Remarks: 

(i)  The  space  and  time  required  by  the  algorithm  is  reduced  by  periodically 
examining  the  parameter  accumulator  units  and  purging  those  that  have 
collected  only  a  few  votes  compared  to  the  top  contenders.  This  is 
possible,  since  it  is  assumed  that  the  noise  in  the  optical  flow  data  is 
uniformly  distributed  in  retinal  space. 

(ii)  The  confidence  of  the  computed  parameter  quintuple  is  the  ratio  of  the 
votes  it  received  to  the  maximum  votes  possible. 

(iii)  If  in  step  4  instead  of  a  clear  winner,  a  number  of  contenders  are  found 
then  step  5  might  have  to  be  repeated  for  each  of  these  for  finer 
resolutions.  Then  the  winner  is  the  parameter  quintuple  that  comes 
through  with  the  highest  confidence. 

Algorithm  III,  performs  well  when  the  quantization  of  the  parameter  space 

is  not  too  coarse.  The  following  table  4.1  shows  the  degradation  of 
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performance  with  coarser  quantization.  Although  the  motion  constraint 
equation  is  nonlinear  it  is  actually  bilinear  in  form.  This  means  that  if  either 
the  two  translation  parameters  are  known  then  the  constraint  becomes  linear 
in  the  three  rotational  parameters.  The  same  holds  true  when  the  three 
rotational  parameters  are  known  in  which  case  the  constraint  becomes  linear 
in  terms  of  the  two  translational  parameters.  This  fact  is  used  to  modify 
Algorithm  III,  so  that  instead  of  voting  in  a  five  dimensional  space,  we 
break  up  the  parameter  space  into  two  subspaces  corresponding  to  the 
translational  and  rotational  parameters  respectively.  The  two  subspaces  are 
arranged  in  a  hierarchical  fashion.  Every  cell  in  the  translation  space  spawns 
a  rotational  space  where  the  linear  intersection  strategy  is  used  to 
accumulate  votes.  The  method  is  depicted  in  figure  4.4.  It  was  found  that 
this  strategy  was  very  robust  with  respect  to  parameter  space  quantization. 
In  fact  very  coarse  translation  as  well  as  rotational  spaces  could  be  used  and 
successively  refined.  Plate  4.1  shows  some  of  the  results  for  the  rotational 
parameter.  The  displays  show  vote  distribution  on  the  geodesic  gaussian 
sphere  which  has  been  quantized  at  various  resolutions.  The  quantization 
parameter  N  denotes  the  number  of  distinct  directions  in  space  used  for  the 


Quantization  Error _ Computed  Parameters _ Error _ 

Trans. (C)  j  Rot.  (%)  yn  a~  (3  y  Trans.(T)  Rot.(T-) 

10  0  8  0.44  1.4  3.0  2.5  2.7  '  30  20.0 

5.0  4  0.97  2.4  3.3  2.7  4.0  9  5.0 

’  2.5  2  1.00  2.0  3.0  2.0  4.4  6  0.5 

1  2  1  0  99  1.9  3.0  2.1  3.9  3  T0_ 


Table  4.1  Quantization  effects  on  fi\e  parameter  houQi  transform 
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Figure  4.4  Hough  Transform  with  Decoupled  Sube  paces 
The  next  two  algorithms  approach  the  nonlinearity  problem  by 
linearizing  the  constraint  equation.  Although  in  this  case  the  price  we  pay  is 
that  the  dimensionality  of  the  parameter  space  increases.  In  the  following 
discussion  it  is  assumed  that  the  not  all  the  translational  velocity 
components  are  zero.  This  is  a  valid  assumption  since  it  has  been  shown  in 
a  previous  section  that  the  motion  parameters  for  pure  rotational  motion  are 
uniquely  detectable. 

From  equation  (4.6)  we  have: 

(yu  -  xv)W  +  vU  -  uV  -  x(a\V  +  7  U)  -  y{?W  +  7  V)  -  xy[aV  +  f)U) 

+  x\PV  +  tW)  +  y2(aU  +  7^) 

Now  we  state  and  prove  a  lemma  regarding  the  feasibility  of  computing  the 
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St 


motion  parameters  using  the  constraint  given  above. 


Lemma  I:  The  optical  flow  components  can  be  expressed  as  an  implicit 


polynomial  equation  F[u,v,x,y,pi,i  =1,. .  8)  =0  involving  the  image  coordinates 


(x,y)  and  eight  linearly  independent  parameters  p,  unless  the  depth  function  is  a 


rational  function  ~r— r>  where  P,  and  Q2  are  polynomials  of  first  and  second 


orders  respectively. 


Proof:  Equation  (4.9)  is  homogeneous  in  the  motion  parameters.  Assume 


that  the  parameter  W  ^  0  (The  case  where  W  =0  but  either  U  orV  ^0  can 


be  worked  out  in  an  analogous  manner).  Dividing  the  above  equation  by  W 


yields: 


(pu  -  xv)  +■  p,«  -  p,«  +  p,  -  p«*  -  p»y  +  p,*J  +  p7y*  -  ptxy  =0  (4.10) 


where 


Pi  =  *o 


(4.11a) 


Pi  =*o 


(4.11b) 


Pi  =o*o  +  0Vo 


(4.11c) 


Pi  —a  +  7*o 


(4. lid) 


Pi  =0  +  7*o 


(4. lie) 


Pt  =7  +  09o 


(4. Ilf) 


Pi  =7  +  o*o 


(4-llg) 


Pt  =0*o  +  oy0 

The  parameters  p/s  are  linearly  dependent  if  and  only  if 


(4.1  lh) 
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k{v  -  *2u  +  i}  -  k<X  -  kiV  +  ktz3  +  k7y3  -  ktzy  =0  (4.12) 

where  the  k,' s  are  constants  not  all  of  which  are  zero.  Let  the  optical  flow 


be  due  to  a  rigid  surface  Z  moving  with  velocity  ( U,  V,  W,a,0,i).  In  this  case: 


U  -  zW  _  s,  a  .  _ 

<*  = - - axV  +  P(x  +  1)  -  IV 


V-VW  ^.1 


(4.13) 


-  a(|T  +  1)  +  0zy  +  ~ix 


Assume  that  the  parameters  p,  are  linearly  dependent.  This  implies  that  in 


equation  (4.12)  there  must  be  at  least  one  i fc,  that  is  not  equal  to  zero. 


However,  if  both  it,  and  i2  are  zero,  then,  all  the  it,’s  must  be  zero.  Hence, 


if  the  parameters  p,  are  linearly  dependent,  then  at  least  one  of  it,  and  it, 


must  be  nonzero. 


Substituting  for  ’u’  and  V  in  equation  (4.12)  from  equation  (4.13)  we 


obtain: 


*i  (  - - - azy  +  0(z3  +  1)  -  7V  )  -  k3  (  — ^ - a(y3  +  1)  +  fizy  +  7*  ) 

+  ij  -  kAx  -  kty  +  k9z3  +  k7y3  -  ktzy  =0 

Since  both  it,  and  ita  are  not  zero,  we  obtain  Z  as  a  rational  function  of  the 


form  X--v\-.  This  proves  the  lemma. 
Q  j(*.v) 


Lemma  II:  The  five  parameters  of  rigid  motion  are  be  uniquely  determined  by 


the  parameters  p(. 


Algorithm  IV:  Equation  (4.10)  is  the  basis  of  a  hough  transform  scheme  to 
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parameters  are  computed  tne  five  rigid  motion  parameters  are  uniquely 
determined.  However  this  algorithm  has  the  draw  back  that  it  requires  an 
eight  dimensional  solution  space.  The  next  algorithm  seeks  to  remedy  this 
problem.  It  is  based  on  the  assumption  that  the  optical  flow  field  is  available 
in  the  form  of  locally  analytic  functions.  This  enables  us  to  obtain  the  first 
order  spatial  derivatives  of  the  flow  field,  which  are  used  to  derive  motion 
constraint  equations. 

Algorithm  Vs  Differentiating  equation  (4.10)  with  respect  to  the  retinal 
space  coordinates  we  have  two  independent  equations: 

(v«,  -  V  -  XV,)  +  p,*,  -  PiU,  -  pi  +  2 paz  -  PtV  =0  (4.14) 

(«*  +  ?«,-  **,)  +  ptvt  -  pj «f  -  pt  +  2 ptp  -  p,x  =0  (4-15) 

The  parameters  in  equations  (4.14)  and  (4.15)  are  linearly  independent 

when  the  depth  function  is  not  of  the  form  given  in  lemma  I.  Selecting  five 

suitable  points  we  obtain  two  alternative  sets  of  simultaneous  equations  in 

five  unknowns.  These  can  then  be  solved  for  the  five  motion  parameters. 

Note,  however,  that  when  p,  =x0  =0  then  then  equation  (4.14)  alone  cannot 

be  used  for  the  computation.  This  is  because  the  parameters  ( PuPi,Pi,P»,Pt ) 

cannot  then  be  used  to  solve  for  the  five  motion  parameters.  A  similar 

restriction  holds  for  equation  (4.15)  when  p3  =y0  =0. 


Algorithm  VI;  Motion  parameters  from  structure  and  optical  flow. 
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When  the  structure  of  the  moving  surface  is  known,  its  motion  is 
unambiguous.  This  method  also  reduces  the  dimensionality  of  the 
parameter  space  by  isolating  the  rotational  parameters.  Two  alternative 
constraint  equations  can  be  used  here.  In  the  first  form  spatial  derivatives 
of  the  optical  flow  function  are  needed.  This  implies  local  analytic 
reconstruction  of  the  flow  function.  In  the  alternative  form  of  the  constraint 
depth  ratios  are  needed,  implying  reliable  (  and  dense)  measurement  of 
surface  normals. 

From  eq.  (4.5)  the  expressions  for  the  spatial  derivatives  of  the  optical  flow 
(u,t>)  are  obtained  as: 


w  ,  ,  W  dZ  ^  _ 

“*  =  "  ~z~  <*•'  x)7Ta7-  “»  +  W* 


(4.16.1) 


,  ,  W  dZ 

-  (*o-  -  «-  1 


(4.16.2) 


/  ,  W  dZ  a 

••  =  -  (»•-  +  ^  +  T 


(4.16.3) 


W  .  W  dZ  _  .  . 

•»  “  -  —  (yo-  ri-p-Qf-  2<*V  +  P* 


(4.16.4) 


w  w 

Substituting  (*0  -  i)  —  and  (y„  -  y)  —  in  the  above  equations  from  equation 

z  z 

(4.5)  we  get: 


«*,-*,=(-«-  axy  +  0(z9  +  1)  -  7y)^ 

+  (»  +  a (y2  +  1)  -  0xy  -  7z)p  +  ay  +  f) x 
uf  =(  -  u  -  axy  +  0(x2  +  1)  -  iy)p  -  ax  -  t 
v,  =(  -  v  -  a(y 1  +  1)  +  fixy  +  7 x)i>  +  0y  +  7 


(4.17.1) 

(4.17.2) 

(4.17.3) 


where  i»  =- 


and  p  =■ 
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Thus  at  every  image  location  (x,y),  a  set  of  three  linear  independent 
equations  involving  the  rotation  parameters  can  be  obtained.  The  functions 

a  7 

V >(x,v )  and  p(x,y)  are  computable  from  the  surface  orientation  values  — | 

dX 

d  Z 

and  -~\ t  t  (see  Appendix  B). 

When  it  is  not  possible  to  measure  derivatives  of  the  optical  flow,  but 
the  ratio  of  depths  at  any  two  image  locations  can  be  estimated,  an 
alternative  linear  constraint  equation  can  be  derived  involving  only  the 
rotation  parameters.  Consider  two  image  points  (*,,»,)  and  (x2,y2)  with 
depths  z  1  and  z 2  respectively.  The  optical  flow  values  at  these  points  are 
(«„»,)  and  (uj,v2).  The  motion  parameters  are  ( U,V,  W,a,0,7).  Using 
equation  (4.5)  we  have  the  following  equations 

“1*1  -  “2*2  =(z2  -  *i)  W  +  -  ai,y,  +  0(*,2  +  1)  -  ~iVi)  -  z3{  -  az2Vt  +  0(z32  +  l)  -  iy2) 

«i*i  -  ®2*2  =(»2  -  V l)  +  *i(  -  a(Vi*  +  1)  +  0*iVi  +  7* l)  -  *»(  -  +  1)  +  P*2V7  +  7*2) 

Eliminating  W  from  the  above  equations  we  have 

/12  cn  +  +  rl27  +  ijj  —  0  (4.18) 

where 

/u  =*iyiVj  -  x2Vi2  +  *1  -  *a  +  — (xjjfiJfa  -  *iy22  -  *1  +  *3) 

*\ 

m,2  =*lxJy,  -  *,sy2  +  y,  -  y2  +  -  *23yi  -  9i  +  Vt) 

*1 

ria  =*i*2  +  V\Vr-  *i2  -  yiJ  +  T~(  ~  *22  -  »32  +  *1*2  +  V1V2) 
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*12  =“l(»2  -  Vl)  -  t'l(I2  -  *l)  +  - (  -  «:(y 2  -  Vi)  +  v2(x2  -  X,)) 

*1 

If  the  surface  normal  value  are  available  everywhere  in  a  region  enclosing 
two  image  points,  then  the  depth  ratio,  — ,  (corresponding  to  those 

Xl 

locations)  can  be  estimated  (of  course,  mathematically,  it  is  possible  to 
compute  this  ratio  if  the  surface  normal  values  are  known  along  a  path  from 
the  one  image  location  to  the  other).  Consequently,  each  pair  of  image 
points  gives  rise  to  a  linear  constraint  in  the  rotation  parameters.  Thus  by  a 
suitable  choice  of  three  pairs  of  image  points  we  can  uniquely  solve  for  the 

U  V 

rotation  parameters  and  subsequently  the  translation  parameters  (-rrr.-nr) 

w  w 

(see  Appendix  A). 

The  novel  feature  of  the  above  algorithm  is  that  it  can  combine  shape 
and  motion  information  under  two  different  conditions: 

(1)  In  the  first  case  the  optical  flow  field  has  been  measured  sufficiently 
’densely’  to  enable  local  reconstruction  of  the  flow  field.  This  enables 
the  first  order  spatial  derivatives  of  the  flow  field  to  be  estimated.  Then 
at  all  retinal  points  where  the  surface  normals  are  known,  we  can 
locally  solve  for  the  rotation  parameters  by  means  of  a  set  of  three 
linear  constraint  equations. 

(2)  Alternatively,  if  the  flow  measurements  are  not  dense,  but  the  shape 
measurements  allow  reconstruction  of  the  depth  function  (up  to  a 
constant  scale  factor),  then  again  locally  we  obtain  linear  constraints  in 
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the  rotation  parameters  (e.g.  equation  (4.18)). 

This  means  that  in  any  image  neighborhood,  full  reconstruction  of  either 
shape  or  image  motion,  helps  to  recover  both  structure  and  motion.  The 
schematic  diagram  of  the  algorithm  is  given  in  figure  4.5. 

The  implemented  algorithm  uses  the  constraint  equation  obtaining  V1 
and  p  from  equations  (4.17),  to  obtain  a  cubic  polynomial  equation  in  the 
three  rotation  parameters.  The  optical  flow  and  its  first  spatial  derivatives  are 
measured  and  the  cubic  constraint  is  used  to  estimate  the  rotation 
parameters  by  the  hough  transform  technique.  So,  although  the  nonlinearity 
remains,  the  dimension  of  the  parameter  space  is  reduced,  which  reduces 
the  size  of  the  search  space.  The  effect  of  parameter  space  quantization  for 
algorithm  VI  can  be  seen  in  table  4.2. 

4.5.  Conclusions 

This  chapter  reported  the  results  obtained  experimentally  using  motion 
interpretation  algorithms  based  on  the  constraints  developed  in  chapter 
three.  The  hough  transform  was  chosen  as  the  preferred  scheme  for 
implementing  the  algorithms  since  it  is  implementable  by  simple  massively 
parallel  architectures  (31].  In  the  case  of  linear  constraints  least  square  error 


Quantization 

Com 

puted  Axis 

Error 

N 

X 

y  z 

(%) 

5 

.25 

.47  .84 

7 

3 

.17 

.81  .56 

36 

2 

.00 

.67  .75 

30 

1 

.00 

.36  .93 

36 

Table  4.2  Error  in  determination  of  axis  of  rotation 
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minimization  methods  can  be  applied,  however  these  techniques  are  no 
longer  appealing  when  the  constraint  equations  are  nonlinear.  The  hough 
Transform  technique  extends  to  the  nonlinear  case. 

This  chapter  also  introduced  the  notion  of  a  hierarchic  hough  transform 
scheme  where  a  coarse  to  fine  refine  strategy  was  seen  to  work  well  with  the 
nonlinear  constraint  equations  that  arise  in  relation  to  rigid  body  motion. 


Figure  4.5  An  Adaptive  algorithm  for  determining  rotation 


t-4*»  •  a 


Chapter  Five 


Active  Navigation 

Egomotion  Perception  by  the  Tracking  Observer 

5.1.  Introduction 

The  perception  of  rigid  motion  finds  application,  in  many  areas.  Some 
of  these  have  been  mentioned  previously.  One  of  the  more  important  ones 
is  the  computation  of  egomotion  parameters  with  the  help  of  visual  stimuli. 
These  parameters  help  in  registering  the  observer’s  motion  with  respect  to 
the  environment  and  are  prerequisites  to  navigation.  The  problem  that  is 
addressed  in  this  chapter,  is  termed  the  Visual  Navigation  problem.  The  goal 
here  is  to  devise  means  of  computing  the  Egomotion  parameters  of  a  moving 
observer,  from  visual  data. 

The  Passive  Navigation  approach  [19]  deals  with  egomotion  parameter 
computation,  when  the  moving  observer  carries  an  optical  imaging  device(s) 
which  obtains  time-varying  imagery  of  the  surrounding  scene.  The 
computation  usually  assumes  that  image  motion  (e.g.  optical  flow)  has  been 
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computed  previously,  and  is  available  as  input  to  the  perceptual  process. 
Recently,  there  have  been  attempts  to  do  “direct  computation”  [64]  or 
“motion  without  correspondence”  [3]  based  on  restricted  situations  like 
planar  moving  surfaces  or  purely  translational  motion.  These  approaches  are 
novel,  and  certainly  merit  attention,  but  are  of  a  preliminary  nature  and 
need  further  study.  Egomotion  perception  under  the  monocular  passive 
technique  is  handicapped  by  nonlinear  constraint  equations.  In  addition,  the 
difficulty  of  the  computational  task  is  compounded  by  the  fairly  large 
number  of  unknown  parameters  to  be  determined.  Therefore,  since  the 
equations  cannot  be  decoupled  or  simplified,  iterative  search  techniques  or 
parameter  space  histograming  (hough  transform)  have  to  be  used  in  the 
parameter  determination.  In  this  chapter,  an  alternative  approach  to 
computing  Egomotion  is  proposed.  This  technique  requires  the  moving 
observer  to  visually  track  an  environmental  feature.  Our  term  for  this 
proposed  method  of  egomotion  perception  is  Active  Navigation.  It  must  be 
clarified,  however,  that  the  sensing  method  used  is  not  active  (e.g.  laser  or 
ultrasound  ranging),  but  the  perceptual  system  operates  in  a  closed  loop 
fashion  with  active  feedback  from  the  image  motion  computation  module. 
The  various  advantages  of  the  method  are  discussed.  An  analysis  of  the 
geometry  of  this  particular  situation  is  examined  to  outline  how  closed  form 
solutions  for  the  parameters  may  be  obtained. 
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The  strategy  advocated  in  this  chapter  calls  for  visual  tracking  and 
combination  of  information  from  stereo  image  pairs.  This  approach  is 
adopted  based  on  a  number  of  experimental  simulation  studies  reported  in 
the  previous  chapter  and  also  in  [28,  84],  which  indicate  the  difficulty  of  the 
passive  monocular  approach.  The  stereo  motion  approach  is  also  under 
investigation  elsewhere  [50,  91],  and  the  possible  employment  of  active 
tracking  to  facilitate  navigation  has  been  suggested  by  visual  psychologists 
[22]. 


It  will  be  shown  that  when  the  observer  is  able  to  track  a  prominent 
feature  point  in  the  imaged  scene,  the  task  of  navigation  is  facilitated  since 
it  is  easier  to  compute  egomotion  parameters,  compared  to  the  non-tracking 
case.  The  emphasis  in  this  chapter  is  on  the  mathematics  governing  the 
imaging  equations  that  are  obtained  while  the  system  is  tracking.  To  track, 
the  system  must  have  some  way  of  measuring  the  error  in  the  retinal  signal. 
Ways  of  doing  this  are  discussed  in  section  5.2. 

The  outline  of  this  chapter  is  as  follows: 

1  Error  velocity  measurement  to  correct  tracking  drift  is  discussed  in  light 
of  the  primate  pursuit  system. 

2.  A  general  form  of  the  relation  between  3D  velocity  parameters  and 
retinal  optical  flow  is  derived.  In  previous  derivations  of  this  relation 
52  the  origins  of  the  body  centered  coordinate  frame  and  viewer 
centered  coordinate  frame  are  taken  to  coincide  at  the  instant  of 
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measurement.  Using  the  general  representation  it  is  shown  why  a 
monocular  observer,  who  is  able  to  track  an  environmental  feature 
point,  has  to  contend  with  a  smaller  number  of  velocity  parameters. 

3.  The  analysis  extends  naturally  to  stereo  imaging  situations,  where  it  is 
shown  that,  by  combining  measurements  from  both  eyes,  a  linear 
equation  in  two  unknowns  is  obtained. 

4.  The  above  constraint  is  applied  with  all  possible  stereo  correspondences 
in  a  small  neighborhood,  so  as  to  minimize  the  square  error.  This  least 
square  error  technique  is  seen  to  work  well  on  simulated  data,  even 
with  the  addition  of  10-20  percent  noise. 

5.  A  new  set  of  constraint  equations  are  derived  for  the  tracking  observer, 
which  allow  closed  form  solution  of  the  egomotion  parameters. 
Simulation  results  are  described  and  implementational  issues  for 
integrating  this  module  in  the  overall  motion  interpretation  scheme  are 
discussed. 

6.2.  Target  selection  via  Velocity  Channels 

The  key  assumption  is  that  the  alignment  of  the  camera  axes  are 
controllable  by  the  system  itself.  In  this  case,  as  the  system  moves  in  the 
world,  the  orientation  of  the  camera  is  continually  adjusted.  This 
adjustment  is  dependent  upon  the  two  dimensional  motion  perceived  on  the 
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Figure  6.1  The  Tracking  Mechanism 


In  the  tracking  system  the  problem  can  be  seen  as,  given  the  image  of  a 
target  environmental  point,  to  generate  control  signals  that  'will  foveate  the 
target.  The  block  diagram  of  a  system  for  accomplishing  this  can  be 
schematized  as  shown  in  figure  5.1.  The  first  and  most  important  point  to 
make  is  that  the  system  can  be  adequately  modeled  by  servomechanism 
concepts.  It  is  relatively  easy  to  see  how  to  generate  the  kinds  of  motor 
commands  for  the  two  movement  systems  to  produce  the  observed 
behavior.  This  of  course  assumes  that  the  target  point  is  identified. 

Target  identification  is  a  central  issue:  in  a  complicated  motion  field, 
how  can  the  target  velocities  be  easily  identified  ?  This  is  a  basic 
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subproblem  in  tracking  using  velocity  sensing  and  is  captured  by  figure  5.2. 

Our  answer  to  this  question  uses  the  notion  of  global  flow  field  vectors. 
Such  vectors  respond  to  velocities  in  every  part  of  the  optical  flow  field.  In 
other  words,  if  we  visualize  the  optic  flow  field  as  a  four  dimensional 
parameter  space  (x,y,u(x,y),v(x,y)),  The  global  flow  field  sums  all  the 
different  flow  vectors  in  a  two  dimensional  (u,v)  parameter  space. 
Detectors  form  a  distinct  set  whose  sensitivities  are  organized  into  channels. 
In  the  case  of  a  particular  flow  field,  some  channels  will  typically  respond  to 
it  and  others  will  not.  Figure  5.3  shows  how  the  channel  concept  can  be 


Tracking  system  must  use  velocities  that  stem  from 
the  object  being  tracked  and  ignore  background 
velocities,  (a)  shows  an  initial  situation  where  a 
target  is  moving  in  the  retina,  (b)  Once  the  tracking 
system  is  engaged,  the  target  is  moving  with  a 
relative  velocity  near  zero  but  the  background  has  a 
large  signal 


Figure  5.2  Target  Identification 
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utilized. 


We  claim  that  with  this  abstract  flow  channel  model  the  problem 
becomes  one  of  determining  which  of  the  channels  should  be  used  for  the 
eye  movement  control  system.  This  means  that  a  mechanism  is  needed  to 
switch  the  appropriate  channels  into  the  servo  system.  Note  that  this 
technique  uses  a  spatially  distributed  detector  array.  Our  contention  is  that 
it  is  appropriate  to  average  the  flow  field  over  this  subset. 

A  mechanism  to  switch  the  detectors  on  once  the  appropriate  ones  have 
been  identified  is  simple  to  understand,  so  we  will  concentrate  on 


The  global  velocity  space  registers  the  number  of  flow  vectors  with 
certain  values.  Channels,  shown  in  the  figure  as  concentric  annular 
regions,  allow  ranges  of  velocities  to  be  selectively  ignored,  (a) 
Initially  the  low  velocity  channel  is  off  (shaded)  allowing  the  system 
to  selectively  register  a  moving  target  (•)  and  ignore  background 
variations  (o).  (b)  Once  the  tracking  mechanism  is  activated  the  high 
velocity  channels  are  blocked  and  again  the  target  velocities  are 
passed,  ignoring  the  background  signals. 


Figure  5.3  Concept  of  the  Velocity  Channel 
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identifying  the  ideas  behind  selecting  the  right  detectors.  The  general  way 
that  this  is  done  is  by  a  feed-forward  mechanism  that  determines  some 
selection  criterion.  The  different  kinds  of  criteria  are  important,  so  it  is 
useful  to  categorize  them. 

1.  Extrinsic  features.  This  method  uses  some  other  feature,  say  color,  that 
also  has  spatially  organized  detectors.  To  track  a  red  object,  the 
detectors  that  register  red  are  used  to  select  the  spatial  component  of 
the  velocity  detectors.  All  such  detectors  with  the  appropriate 
correspondence  are  used. 

2.  Intrinsic  Features.  This  method  uses  some  particular  range  of  values  for 
the  flow  field  itself,  say  all  values  over  a  certain  velocity  magnitude.  To 
track  an  object,  all  the  detectors  that  satisfy  the  intrinsic  criterion  are 
switched  into  the  movement  control  system. 

These  distinctions  are  important  as  they  correspond  to  two  different 
types  of  tracking  situations.  In  navigation,  where  the  entire  spatial  field  is 
moving,  an  extrinsic  feature  is  appropriate.  In  pursuing  a  small  target,  that 
target  is  usually  moving  differently  with  respect  to  the  background,  so  an 
intrinsic  feature  may  be  appropriate. 
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5.3.  Measuring  Egomotion 
5.3.1.  Background 

Consider  first,  the  monocular  imaging  situation  where  a  sensor  is 
moving  relative  to  a  static  scene.  The  co-ordinate  frame  (X,Y,Z)  is  fixed  to 
the  sensor  (see  figure  5.4).  The  viewing  direction  is  along  the  positive  Z- 
axis. 

The  analysis  presented  here  assumes  a  rotating  and  translating  observer 
moving  in  a  static  environment  However,  since  the  velocity  parameters  that 
characterize  the  motion  are  all  relative  to  the  observer’s  frame  of  reference, 
the  analysis  per  se,  is  not  affected  by  multiple  moving  objects.  The  analysis 
assumes  the  velocity  representation  for  the  motion  parameters. 

The  reference  coordinate  frame  is  fixed  to  the  observer.  There  is 
another  coordinate  frame  fixed  at  the  point  ’S’  on  the  body  (see  figure  5.4). 
The  point  S  has  the  velocity  T,  ={U„V,,W,).  At  the  time  of  observation  the 
reference  and  the  body  frame  axes  are  parallel  to  each  other.  The  rotational 
velocity  of  the  body  is  given  by  the  vector  fl  =(a,0^).  The  3D  velocity  of  a 
point  P  —  (X,  Y,Z)  on  the  body  is  given  by  the  equation 

X=T+  [*]  (X-  X.)  (5.1) 

where  X,  =[X,,Y,,Z,)  denote  the  position  of  the  body  origin  ’S’  ,  and  X 

denotes  the  3D  velocity  of  P  (the  ‘dot’  operator  is  used  throughout  to 

signify  differentiation  with  respect  to  time),  also 
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0-7  0 

[#]  =  7  0  -  or 

7  0  <*  0  . 

The  image  formation  is  modeled  by  the  perspective  projection  model  (see 
chapter  three).  The  projection  of  a  point  P={X,  Y,Z)  is  denoted  by  p=(x,y). 


The  projective  relation  is 


The  constant  f  is  the  focal  length  of  the  imaging  system.  It  is  the  distance 
separating  the  nodal  point  of  the  camera  (or  eye)  and  the  image  plane, 
moving  along  the  optical  axis  (i.e.  Z  axis).  In  subsequent  steps  the  constant 
f  is  assumed  to  be  unity.  The  velocity  of  image  points  in  the  2d  image 
space  is  called  optical  flow.  The  relations  between  the  2D  and  3D  velocities 
are  obtained  by  differentiating  the  equation  (5.2)  and  substituting  from 
equation  (5.1). 


•  =y  =■  -  “U  -  -jr  +  V1  y~jr\ '+  P\xv  -  rj]  +  tt(«  - 

When  the  origin  of  the  body  coordinate  frame  coincides  with  the  reference 
or  observer  coordinate  frame  then  X,  =Y,=Z,  =0,  and 
T  =  7’0  =(  U,  V,  Wj, which  simplifies  the  equation  for  optical  flow  to  give  : 


U  -  zW  ..  2  ,, 

u  — - - - azy  +  0(z2  +  1)  -  7 V 


V  -  yW  . 

v  = - - - a(  1  +  y7)  +  0zy  +  7 


(5.4.1) 


(5.4.2) 
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Figure  5.4  Imaging  Geometry  and  motion  representation 


The  above  pair  of  equations  embodies  the  constraint  that  the  optical  flow 
(u,»)  imposes  upon  the  the  parameters  of  rigid  motion.  Thus  all  an  observer 
has  to  do  to  determine  where  he  is  going  is  to  measure  the  retinal  velocity 
pattern  and  then  use  the  above  pair  of  equations  applied  at  least  five  points 
[13,  67,  84],  to  determine  the  3D  velocity  of  egomotion.  Note  that  there 
are  six  velocity  components  (i.e.  three  for  translation  and  three  for 
rotation).  Unfortunately  however,  all  the  six  parameters  cannot  be 
computed  by  monocular  visual  data.  This  is  because  of  the  depth  term  ’Z’ 
that  occurs  in  the  above  pair  of  equations.  The  depth  introduces  a  scaling 
effect,  whereby  other  things  being  equal,  multiplying  the  translational 
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components  and  the  depth  by  the  same  constant  factor  leaves  the  perceived 
retinal  motion  unchanged.  Thus  for  example  an  object  at  a  certain  distance, 
translating  with  a  certain  speed  generates  the  same  optical  flow  field  when  it 
is  twice  as  far  away  and  traveling  in  the  same  direction  with  twice  the  speed. 

Thus  the  monocular  observer,  lacking  depth  information,  must 
eliminate  the  depth  factor  from  the  optical  flow  constraints.  This  will  then 
imply  that  the  observer’s  translation  can  only  be  determined  up  to  a  scale 
factor.  Thus  the  number  of  egomotion  parameters  of  interest  are  five  - 
pertaining  to  the  direction  of  translation  and  the  rotation. 

When  the  depth  variable  is  eliminated  from  the  above  equations  we  have 

~  *  _  »  +  **  +  1)  +  7y 

Vo~  V  v  +  *[y*  +  1)  -  fixy  -7*  '  ‘  ' 

U  V 

where  (*0>yo)  =(_,_)  represents  the  direction  of  translation  of  the 
observer’s  coordinate  frame. 

The  above  constraint  equation  demonstrates  the  difficulty  of  motion 
computation  for  a  monocular  observer.  It  is  nonlinear  as  well  of  high 
dimensionality,  both  this  properties  in  conjunction  make  the  problem 
difficult  ((13,  28,  52,  53,  70,  84]). 

5.3.2.  The  tracking  Advantage 

It  will  now  be  shown  that  in  case  the  monocular  observer  can  discern  a 
distinguishing  feature  or  mark  on  the  observed  surface  then  the  perception 
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problem  becomes  simpler.  Suppose  that  the  surface  in  view  has  an  easily 
distinguishable  and  localized  feature  at  point  ‘S’  whose  corresponding  image 
location  is  [z,,y,).  In  this  case  we  can  shift  the  body  origin  to  the  point  S 
and  rewrite  the  optical  flow  equations  as  in  (5.3).  In  addition 


U,  -  t,  W, 


2. 

V.  -  y,  w, 


(5.6) 


2. 


Combining  equations  (5.6)  and  (5.3)  one  obtains: 


«.  +  (*.-*)  W,'  azy,-0( l+zz,)  +  yy. 


Z'  Z' 

v,  +  (y.-  y)  w.'  0*.y  - 


-  azy  +  fifl+z*)  -  yy  (5.7.1) 


-  a(l+y*)  +  fizy  +  yz  (5.7.2) 


w 1 

where  the  ‘prime’  operator  signifies  scaling  by  Z,,  i.e.  Wr/=— — .  Note  that 

2, 


the  translational  parameters  with  respect  to  the  observer’s  frame  (  i.e.  the 
observers  actual  translation  )  are  related  to  the  body  centered  translational 
parameters  by 


U'=U,<-  fi  +  yy, 

V  —  V,1  +  a  -  7*,  (5.8) 

W’  =  W,1  -  ay,  +  fix, 

The  above  analysis  illustrates  the  fact  that  given  the  ability  to  estimate  the 
projected  velocity  of  a  localized  feature  accurately,  the  constraint  equations 
reduce  in  dimensionality  by  one. 

A  similar  result  may  be  obtained,  as  can  be  expected,  when  the  moving 
observer  is  able  to  track  a  single  feature  point  so  that  it  appears  stationary 
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on  the  retina  at  position  (0,0).  In  this  case  we  assume  that  the  tracking 
motion  consists  of  rotations  about  the  axes  that  are  orthogonal  to  the  line  of 
sight  or  the  optical  axis  of  the  lens.  The  tracking  motion  is  a  rotation 
(w,,wf,0),  which  is  superimposed  upon  the  actual  parameters  of  motion. 

Let  S  =(0,0 ,Z0)  be  the  spatial  coordinates  of  the  point  being  tracked. 
Assume  that  the  observer  can  track  an  environmental  point  and  hold  it 
steady  on  the  optical  axis  {  Z  axis).  Therefore  the  optical  flow  field  will 
have  a  singularity  at  the  origin  of  the  retinal  frame,  where  the  flow  value  is 
zero.  At  the  time  of  observation,  the  tracked  point  tends  to  move  along  the 
observer’s  optical  axis  (figure  5.5). 

Consider  an  observer  moving  with  translation  (U,V,W)  and  rotation 
(a,£,Tr).  Then,  if  the  body  frame  origin  is  taken  to  be  at  S,  from  equation 
(5.8),  remembering  that  U,  =V,  =0: 

W  =~  =-  B 
Zq 

V*  —  —  —A  (5-8) 

Z  0 

w,  =  w 

furthermore  the  optical  flow  equation  (5.3)  becomes: 


-  Axy  +  B\  1 J-  +  r*]  -  7 y 

~z~  ~  ~  ~t  +  +  Bzy  +  11 


(5.10) 


where  A  —a  +  w,  and  B  =0  +  u>,. 


Figure  5.6  Monocular  Tracking 


Eliminating  Z  from  the  above  we  have: 

u  +  Axy  -  B( X3  +  1)  ■+  7y  __  B  +  xW' 
v  +  A(1  +  yJ)  -  Bxy  -7 *  A  -  yW' 

where  W1 

Zo 


(5.11) 


The  constraint  equations  derived  above  are  similar  in  form  to  equation 
(5.5).  However,  in  this  case  the  dimensionality  of  the  parameter  space  has 
been  reduced  from  five  to  four  without  increase  in  the  degree  of  nonlinearity  of 
the  constraint  It  is  important  to  note  that  the  observer  can  determine  his 
direction  of  translation  since  from  equation  (5.9)  we  have 
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Thus  even  without  explicitly  measuring  his  tracking  motion,  the  observer 
can  determine  the  scaled  translation  ( U\  V1,  W').  We  next  examine  the 
constraint  equation  (5.10)  and  show  how  it  may  be  used  to  actively  compute 
the  direction  of  translation. 

5.4.  Stereo  tracking 

It  can  be  expected  that  stereoscoping  viewing  can  simplify  the  task  of 
motion  perception.  Binocular  imaging  system  does  introduce  a  new 
complication  in  that  in  addition  to  the  task  of  retinal  motion  estimation,  one 
must  also  accomplish  stereo  fusion.  However  stereo  fusion  is  a  simpler  task 
than  optical  flow  estimation,  and  a  recently  published  algorithm  is  reportedly 
able  to  handle  this  task  reasonably  satisfactorily  [69].  The  advantages  of 
stereo  imaging  for  analyzing  motion  are: 

1.  The  motion  constraint  equation  is  linearized. 

2.  Tracking  an  environmental  feature  point  greatly  simplifies  motion 
computation  under  stereo  imaging  conditions. 

3.  The  epipolar  constraint  is  a  powerful  aid  in  handling  the 
“correspondence  problem”  for  both  stereo  fusion  as  well  as  retinal 
motion  estimation.  (  In  this  section  the  reason  for  this  will  be  sketched, 
but  it  will  not  be  detailed) 
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5.4.1.  Tracking  in  Stereo  with  Parallel  Camera  Axes 

The  monocular  imaging  geometry  described  previously  is  augmented  by 
two  other  coordinate  frames  located  at  the  points  (d, 0.0)  (the  left  eye  frame 
)  and  (-  d, 0,0)  (  the  right  eye  frame)  respectively.  The  central  frame  can  be 
imagined  as  the  "head  frame”  and  the  two  other  frames  as  the  camera  or 
"eye”  frames.  The  situation  is  depicted  in  (figure  5.6).  In  this  scheme  there 
is  no  vergence  between  the  two  eye  frames  (  rather  the  eyes  verge  at 
infinity).  This  means  that  the  corresponding  axes  of  all  the  coordinate 
frames  are  parallel.  Furthermore,  it  is  assumed  that  the  frames  are  rigidly 
attached  to  each  other. 
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The  tracking  action  is  with  respect  to  the  head  frame.  Now  if  the  head 
frame  is  tracking  a  feature  point  Sk  at  (0,0, p)  then  its  image  on  the  left  and 
right  eye  frames  are  at  (—  e ,0,0)  and  (e,0,0)  respectively.  The  relation 
between  the  depth  p  and  e  is 


2d  d 


Once  again  for  simplicity  of  explanation,  consider  the  relative  motion 
between  the  observer’s  head  frame  and  the  observed  rigid  scene,  as  due  to 
egomotion.  The  motion  parameters  are  the  translational  velocity  (U,V,  W) 
and  the  rotational  velocity  (a, 0,7).  The  observer’s  tracking  movement  is 
confined,  as  before,  to  the  rotation  (w„w,,o),  with  respect  to  the  head  frame. 
The  tracking  motions  executed  by  the  the  eye  frames  include  this  rotation 
plus  translations  in  depth  of  -  duf  and  dwt  respectively. 

Consider  an  image  location  (xt,y)  in  the  left  frame,  and  its  stereo  pair 
( xrly )  in  the  right  frame.  The  disparity  is  given  by 


6  =xr  - 


*/ 


2d 

Z 


where  Z  is  the  depth  of  the  point  in  space  giving  rise  to  the  stereo  pair. 

The  motion  parameters  are  as  before  (U,V,  W)  and  (a,/9,Tr),  with  respect 
to  a  hypothetical  head  frame  located  between  the  two  stereo  coordinate 
frames.  The  head  frame  is  assumed  to  track  the  environmental  feature  Sk 
( the  subscript  refers  to  the  fact  that  the  nomenclature  is  with  respect  to  the 
head  frame).  Therefore  equations  (2.9)  hold.  The  motion  parameters  with 
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respect  to  the  stereo  frames  are: 


L  :  Ti  =  T[  +  T,  fl,=n,k  +  nr 

R  :  T,  =Trb  +  T*  nr=tib  +  n* 

where  the  subscripts  l  or  r  refer  to  the  left  or  right  frames,  and  the 


superscripts  4  or  tr  refer  to  body  parameters  (  or  representing  actual 


motion)  and  motion  induced  due  to  the  tracking  motion  respectively.  These 


components  can  be  expanded  to 


T,b=(  U  ,  V  +  1d  ,  W  -  fid  ) 
Tb  =(  U  ,  V  -  1d  ,  W  +  fid  ) 


n.b=(  a,/?  ,  7  ) 
n,b=(  a,  *,7) 


Ti*  =(0,0,-  Ugd  )  ft|b=(ui|tw,,0) 

Tf-(0,0,«fd)  nf-(w.  ,w,  ,0) 


It  can  be  seen  that  the  motion  of  the  tracked  point,  is  given  as 


T,  =  (  0  ,  0  ,  W  )  in  both  the  left  and  right  frames.  The  rotation  of  these 


frames  is  also  the  same,  namely  (  A  ,  B  ,  7  )•  Finally,  the  tracked  point  is 


located  with  respect  to  the  two  frames  as:  s,  ={  -  d  ,  o,p)  and 


S,—{d,0,p).  Therefore  from  equation  (2.3)  we  get  the  optical  flow 


constraints  for  the  left  eye  as: 


-  Ax,y  -  B  (1 


v,  --  -,4(1-  y  +  | t*  )+  B  {X,  f  +  l~ 


*  7  I  *1 


where  A  =  0  -  u /,  and  B  —  3  -•  wp.  In  the  above  equation,  making  tb 


substitution  (  Z 


)  we  have. 
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m 


I 


=z,4  +  5(1  -  y)  -  7* 

V,  =  y4  -  A(1  -  y)  +  7 - 2 - 


(5.12) 


yy 

where  4  =-  —  -  Ay  +  5 — - — similarly  the  optical  flow  for  the  right  eye 
«  2 

is  given  by: 


«r  =2,4  +  5(1  -  y)  -  7» 


tr  =  y4  -  A ( 1  -  y)  +  7 


(*/  +  *r) 


(5.13) 


From  the  above  equation  we  get: 


*r  -  *1 

which  leads  to  a  constraint  equation  in  two  parameters: 


-  Vf  p  *1  “T 

’’  '*<l '  71*1— 


*,  +  *r 


(5.14) 


This  with  stereo  tracking  it  is  possible  to  obtain  a  linear  constraint  in  two 
unknowns  at  every  point  of  measurement. 


6.4.2.  Tracking  with  Convergent  Stereo  Imaging 


In  this  case  the  optical  axes  of  the  two  cameras  converge  onto  a  point 
in  the  environment  that  is  being  tracked.  The  geometry  is  illustrated  in 
figure  ,).7  j  VVe  will  generally  deal  with  the  left  coordinate  frame,  with 
respect  to  which  the  various  quantities  will  be  written  as  in  the  monocular 

>* ;l‘*c  V\  t,*  n  w«-  need  u,  reference  the  i j u ai. • ; i !«•  s  with  respect  to  the  right 


<  1  :,e 


4  A  1,4  A 


pruned  e  g  x  The  tracking  rii"’ioti  irr 


ii'lejn  r< > '.i '  •  >  t,  a  •.  e  <><•:>,« 


I  he  r<>  tat  .on  .  is  at><>  u  t 
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-  2(#  +  9')eot(9  +  t')F  (5.16) 

Let  the  motion  of  the  observer  be  described  by  the  translational  velocity 
T  ={U ,V,W)  and  rotational  velocity  fl  =(0,^,7).  These  parameters  are 
defined  with  respect  to  the  point  L  in  the  body,  which  also  happens  to  be 
the  origin  of  the  left  coordinate  frame.  The  tracking  motion  of  the  system 
consists  of  three  independent  rotations  with  respect  to  the  observer.  These 
three  rotations  correspond  to  the  three  motors  in  figure  5.1.  The  angular 
velocity  u  corresponds  to  the  rotation  of  the  plane  PRL  about  the  axis  LR. 
The  other  two  angular  velocities  are  9  and  9',  which  affect  the  left  and  right 
coordinate  frames  respectively.  Let  the  sense  of  w  be  positive  in  the 
direction  from  L  to  R.  Then  the  tracking  angular  motion  of  the  left  frame 
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with  respect  to  the  observer  is  given  by  w,  and  that  of  the  right  frame  is  wt'. 
Note  that  we  will  the  express  all  the  motion  parameters  measured  in  a 
frame  with  respect  to  basis  vectors  defined  in  that  frame.  Therefore: 

ut  =  (  -  iusint,  0,  ucoat) 

=(  -  wain#',  -  u>cos0') 

If  the  rigid  motion  parameters  with  respect  to  the  right  coordinate  frame  are 
given  by  the  translational  velocity  T  and  the  rotational  velocity  n '  then  we 
have: 

T  =*x-(T  +  ft  x  >) 

Cl'=Rxn 

where  V  denotes  the  vector  product  and  *  denotes  matrix  multiplication.  In 
addition,  the  rotation  matrix  Rx  expresses  the  transformation  due  to  the 
rotation  by  X  =  *■  -  {!+#')  between  the  left  and  right  frames  and  is 


cosX  0  -  ainX 
0  1  0 
ainX  0  coaX 


Now  from  equation  (5.8)  we  have: 


U  +  [0  +  w,)F(t)  =0  (517) 

V  -  (a  +  u i,)F(  t)  =0 

Observe  that  the  above  equations  involve  five  unknown  motion  parameters. 
If  we  now  differentiate  these  equations  we  have 

F(t)  =  W 

U  +  0F(t)  -  3F(t)  +  J,F(t)  -  *,F{t)  -0  (5.18) 

V  iF(t)  oFft )  J,  F{  t )  w.F(t)  0 

■  Although  here  we  consider  a  reclilinearly  moving  observer,  the  translational 

i 

! 
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velocity  (U,V,W)  undergoes  change  due  to  the  rotation  of  the  frame  in 
which  the  observations  are  made.  Thus  we  obtain: 

U  =  (0  +  u9)W-  h  +  u,)V 
V  (a  +  a ,t)W  +  (7  +  U,)U 
W  =(a  +u>,)V-  (0  +u,)U 

Similarly  the  rotational  velocity  ( a,0,~i ),  undergoes  change  due  to  the 
tracking  motion,  as  follows: 

a  =  u/,7  -  w,0 
$  =-  w,7  +  u,a 
7  =0 

Introducing  the  parameters  A  =a  -t ■u11fl=j  +  u,  and  c  =-j  +  w,, 
substituting  for  U ,  V,  W,  a  and  0  from  the  above  relations,  and  replacing  U 
and  V  from  (5.17),  we  have,  from  the  last  two  equations  in  (5.18) 

l  A  ♦  w.)F(l) 

and 


r  = 


2AF\  t )  Bu,F{  l)  *  u,F(t) 


(B  ♦  wf)F|«) 

Finally  eliminating  C  from  the  above  pair  of  equations  and  using  the 
remaining  equation  of  (5.18)  we  obtain  the  pair  of  independent  equations: 


4 1  •  «’ 


F\i  I 
FI  t) 


*  ,  4  •  •  ,  H  -  4  ,  0 


(  5.10.1 ) 
(5  I  »  2  I 


w  r, <■  r* 
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+2  =  2utF(t)F{t)  ■+■  wtFa(t)  +  wfw  ,F3(t) 

=2u>tF(t)F(t)  +  u>tF*(t)  -  wta i,F3(t) 

=  (‘*',w,  +  <*'i<*'»)^’2(0  +  2F(t)F{t) 

From  equation  (5.19)  we  obtain  two  seta  of  solutions  for  the  motion 
parameters.  Eliminating  the  parameter  B  we  have: 

+  M  +  c  —  0  (5.20) 

where  a  +  if,  i  =2^J^i  and  c  =+l  - 

To  summarize,  the  solution  method  consists  of  obtaining  the  solutions  to 
the  pair  of  equations  (5.19.1)  and  (5.19.2).  Since  closed  form  solutions  are 
obtained  at  every  time  instant  and  assuming  the  computation  errors  to  be 
uniformly  random,  we  perform  smoothing  on  the  time  series  of  the 
computed  parameters,  to  eliminate  a  large  portion  of  the  error. 

The  important  aspect  of  this  method  of  computation  of  the  motion 
parameters  is  as  follows: 

fa)  The  solution  is  closed  form,  requiring  no  iteration  or  search. 

(b)  The  constraints  are  derived  from  the  observed  tracking  velocities  and 
rotations.  We  do  not  need  the  optical  flow  measurements. 

icj  Here  the  observables  are,  These  can  be  measured  quite 

accurately  by  analog  measurement  apparatus.  This  possibility  forms  a 
strong  motivation  for  the  tracking  approach. 

i!  Th«  optif.il  Ho*  field,  in  our  motion  perception  scheme  is  only  used  to 
disambiguate  between  the  possible  interpretations  computed  by  the 


l  .V'aS.aVkVaS m 
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tracking  module.  This  is  always  possible  since  under  extended  periods 
of  observation  the  optical  flow  field  generated  is  compatible  with  one 
and  only  on«  interpretation. 

5.5.  Experiment* 

5.5.1.  Stereo  with  parallel  camera  axes 

We  have  carried  out  some  preliminary  experiments  on  artificial  images 
to  date.  Assuming  binocular  vision  and  tracking  we  obtain  A  and  -7  from 
which  we  can  recover  the  other  parameters. 

The  experiments  were  performed  under  certain  assumptions: 

(a)  The  optical  flow  is  known  at  each  point 

(b)  There  are  a  reasonable  number  of  points  in  the  vicinity  of  the  tracked 
point. 

(c)  The  translational  velocity  parameter  along  the  camera  axis  (i.e.  Z  axis) 
is  small  compared  to  the  average  scene  depth. 


The  algorithm  used  to  recover  A  and  7  is  as  follows. 

(a)  Obtain  possible  stereo  correspondences  by  epipolar  constraints,  i.e.  the 
difference  in  the  y  values  in  the  two  camera's  image  frame  has  to  lie 
within  a  certain  value  which  we  shall  call  the  radius. 

ib)  Calculate  the  depth  of  the  point  by  the  corrcsjxindcnct 
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(c)  Throw  away  all  correspondences  which  give  extreme  values  of  (depth  ! 

of  point  -  p  )  where  p  =  depth  of  tracked  point. 

(d)  Repeat  step  c  until  the  number  of  points  has  been  reduced  to  some 
threshold  (typically  the  original  number  of  points). 

(e)  Calculate  the  coefficients  of  A  and  7  in  equation(5.3)  for  the  remaining 
points,  and  apply  the  least  squares  method  to  obtain  A  and  7. 

The  experiments  were  performed  for  values  of  f  (focal  length)  ranging 
from  35mm  to  200mm,  d  (stereo  baseline/2)  ranging  from  4  cm  to  20  cm,  f 
(angle  of  rotation)  varying  between  2  degrees  and  5  degrees  and  additive 
noise  of  up  to  20  percent  We  found  that  the  algorithm  was  quite  stable 
within  these  limits,  recovering  A  and  7  to  within  10  -  25  percent  accuracy. 

As  the  radius( distance  between  epipolar  lines  for  correspondence)  increased 
the  error  increased.  Further,  if  steps  (c)  and  (d)  of  the  algorithm  were  not 
carried  out,  the  errors  were  found  to  be  much  bigger,  specially  as  the  radius 
became  large.  The  results  are  summarized  in  table  5.1.  (Note  that  the 
values  of  V  calculated  from  equation  5.9  are  tabulated,  together  with  7). 

The  parameters  relevant  to  table  5.1  are  (the  unit  of  length  is  one  pixel 
width) : 

Focal  length  1000 
stereo  baseline  d  1000 

Kotation  (  a,d.-r  )  (  O.OfiHH  .  0.0229  .  O.OfisH  ) 
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Translation  =  (  U  ,  V,  W)  =  (  ■  227  ,  611  ,  34  ) 

Percentage  of  noise  =  10 

The  algorithm  works  with  large  amounts  of  additive  noise  because  most 
of  the  noise  points  get  removed  in  step  (c)  of  the  algorithm.  Other  points 
whose  depth  is  calculated  to  be  very  different  from  p  also  get  removed, 
leaving  points  for  which  the  A  coefficient  in  equation  (3.3)  are  quite  similar 
which  gives  better  results  with  least  squares.  The  error  is  due  to  two 
factors: 

(a)  Discretization:  This  becomes  specially  important  when  the  optical  flow 
or  the  depths  are  small. 

(b)  Wrong  correspondences:  These  may  be  reduced  by  using  more 
elaborate  statistical  smoothing  techniques  in  tandem  with  the  parameter 
evaluation  stage  (e.g.  the  overall  scheme  can  be  a  few  iterations  of  a 
noise  filtering  step,  then  parameter  hough  transform  followed  by 


radius 

av.  false  match  count 
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V 
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Tahir  5.1  Measurement*  for  tracking  with  stereo  fusion 
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further  pruning  of  the  input  data  and  so  on). 


5.5.2.  Convergent  Stereo 

The  geometric  configuration  in  this  case  is  depicted  in  figure  5.7.  The 
simulation  experiments  were  performed  under  the  following  assumptions: 

(l)  The  precision  of  angular  measurements  is  up  to  half  a  minute  of  arc. 
(i.e.  the  truncation  error  ~  .0001  radians). 


»*»:< 


J| 


AC  It  A  L  VAl  l  rS  j 
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Figure  5.8  fimt’  evolution  of  jngul.ir  position 
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(3)  The  motion  of  the  system  is  smooth.  In  particular  the  path  of 
translation  is  piecewise  linear  in  time,  and  the  speed  of  translation 
changes  very  slowly  (no  acceleration).  Furthermore,  there  is  no 


precession  or  any  other  change  in  the  rate  or  direction  of  rotation. 

The  motion  stimuli  were  generated,  synthetically,  by  applying  exact  rigid 


transformations  (rotations  and  translations),  using  user  specified  parameters. 
The  time  progression  was  modeled  by  a  sequence  of  small  intervals  (ticks). 
At  every  tick  three  additional  rotations  were  generated  to  maintain  tracking. 

The  output  of  the  data  generation  program  consists  of  the  sequence  of 
values  of  I  and  #’,  which  are  the  angles  made  by  the  optical  axes  of  the  left 
and  right  camera  respectively  with  the  base  line.  Additionally,  the  rotation 
of  tiie  camera  system  about  the  baseline  was  also  recorded  ( -  of  figure  5.1). 


i  t  *»  #  ;  **«  :  .  **  •  i »  ■  t  * 


4  \  i  •  *  • 


i  *  ,  r  r  V  . r.  •  1  \  ’. 


(2)  The  error  in  estimating  the  angular  positions  of  the  camera  axes  is 
random  and  follows  a  normal  distribution  with  zero  mean  and  standard 
deviation  no  more  than  five  times  the  error  due  to  truncation. 


V  computations  were  done  with  respect  to  the  left  frame  The  values  of  # 
and  *  were  artificially  corrupted  by  random  noise  following  a  zero  mean 
:,o-ma  d.'tr1  bu  ’  ion  .  with  standard  deviation  around  OS* 
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actual  values,  the  observed  values  and  the  smoothed  values  of  #  are 
plotted.  The  scale  for  t  is  in  radian®  while  that  of  'time'  is"  in  ticks. 

(ii)  The  depth  values  and  its  derivatives  were  computed  from  the  smoothed 
curve  #(<),  using  the  equations  (5.15)  and  (5.16). 

(iii)  The  component  of  the  tracking  rotation  is  simply  the  first  derivative 
of  #(<)  with  its  sign  reversed.  The  other  components,  v,  and  w,  were 
computed  from  w(0  and  # ( r ) .  by  w,  =--sin t  and  w,  =wcos0. 

( i v )  Finally  the  value  of  the  body  rotation  is  computed  from  equation 
(5.20).  The  remaining  parameters  of  motion  are  obtained  from 
equations  (  5.19). (  5.17)  and  (5.18)  by  back  substitution. 

A  typical  set  of  values  for  the  parameters  used  in  the  experiments  is: 

Merro  baseline  length  1.0 
Initial  value  of  t  too* 

Initial  value  of  #'  -  77* 

Initial  depth  of  tracked  point  18.62 

Vrrvif  u:.:.ort:..i  ;7«  d  1  sjwcifying  the  rotation  axis  (  1,  2,  3) 

The  angie  of  rotation  per  time  step  -  2* 

The  translation  vector  (  0.2,  0.3.  0.1) 

The  results  for  this  set  are  plotted  in  figures  5.8  and  5.9.  In  figure  5.9  the 
ae*  .a  value  for  the  x  rotnpone tit  of  the  rotation  and  the  two  computed 
.  .  •  -  t  o-;  ;.s  u.  .'>  19  art  s.tov.  n.  The  •  rror  in  the  computed 

\  a!  : »•  s  of  the  roti'ioti  parameters  were  less  than  one  percent.  After  back 
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substitution,  the  error  in  the  values  obtained  for  the  translational 
parameters  were  found  to  be  around  5%. 

5.6.  Summary  &  Conclusions 

In  this  chapter  a  mathematical  framework  for  active  navigation, 
employing  tracking,  has  been  developed.  The  results  reported  here  suggest 
that  there  is  a  better  alternative  to  the  “passive”  technique  for  visual 
navigation  [19,  52,  68].  This  new  approach  is  termed  Active  Navigation.  The 
qualifier  “active”  is  used  because  the  mobile  system  is  required  to  track  an 
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Figure  5.9  Time  Evolution  of  rotation  about  xaxis 
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environmental  feature. 

The  passive  method  has  been  unable  to  deliver  practical  algorithms  for 
motion  parameter  estimation  due  to  the  fact  that  the  constraint  equations 
that  arise  are  nonlinear  and  involve  a  fairly  large  number  of  unknowns. 
The  computation  in  such  cases  is  hampered  by  sensitivity  to  small  errors 
and  the  need  to  have  initial  estimates  of  the  solution  in  order  to  commence 
the  search/iteration  in  the  nonlinear  parameter  space  [84] 

The  idea  that  tracking  environmental  points  may  be  beneficial  to 
navigation  has  previously  been  put  forward  by  Cutting  [22].  His  analysis, 
however,  is  largely  qualitative.  A  general  analysis  of  the  tracking  geometry 
shows  that  the  difficulties  in  motion  parameter  computation  are  alleviated 
under  monocular  imaging  and  largely  removed  for  the  binocular  case.  The 
problem  with  the  binocular  situation  is  that  both  motion  as  well  as  stereo 
correspondence  is  needed.  Simulation  experiments  were  conducted  to 
examine  the  feasibility  of  this  approach.  The  results  are  acceptable,  when 
the  the  stereo  fusional  radius  is  is  known.  Therefore  in  itself,  the 
stereo/motion  approach  cannot  be  recommended  in  practical  cases  due  to 
accumulation  of  stereo  and  motion  matching  errors. 

On  the  other  hand,  an  analysis  of  a  tracking  system  as  in  figure  5.1 
shows  that  if  the  position,  angular  velocity  and  acceleration  of  the  tracking 
motors  can  be  measured  over  a  period  of  time,  then  closed  form  solutions 
of  the  egomotion  parameters  are  obtainable.  In  general,  two  solutions  are 
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obtained  at  every  time  instant  However,  an  extended  period  of  observation 
can  disambiguate  between  these,  since  the  two  solution  trajectories  intersect 
at  the  correct  solution  (see  figure  5.0). 

Experimental  simulation  of  the  time  evolution  of  of  the  solution  space 
was  conducted,  using  discretely  generated  motion  data  corrupted  by  noise 
that  was  as  much  as  20%  of  the  rotation  parameter  value.  The  computational 
scheme  proved  to  be  quite  robust  with  respect  to  these  random  noise 
fluctuations.  The  point  is  that  the  equations  are  stable  enough,  so  that 
perturbations  caused  by  noise  are  not  overwhelming.  Therefore,  the  correct 
solution  trajectory  can  be  recovered  by  temporal  smoothing  and 
interpolation. 

Thus  a  strong  case  can  be  made  for  adopting  this  method  for  visual 
navigation,  when  the  mobile  system  is  undergoing  steady  motion.  Even 
when  the  steady  motion  assumption  holds  only  approximately  (e.g.  when 
there  is  a  steady  translational  acceleration),  the  stability  of  the  equations 
allow  us  to  obtain  reasonable  estimates  of  the  motion  parameters.  This 
suggests  a  cooperative  scheme  for  the  motion  perception  task.  This  involves 
using  the  closed  form  solutions  obtained  from  the  tracking  constraints  to  be 
used  as  initial  estimates  in  the  monocular  and  binocular  “flow”  modules  to 
refine  the  solution  and  compute  structure  of  the  observed  surface.  Such  a 
scheme  is  outlined  in  the  concluding  chapter. 


Chapter  Six 


Conclusion 

8.1.  Summary  and  Diseuasions 

The  purpose  of  this  research  was  to  analyze  the  problem  of  Rigid  Body 
Motion  Perception.  The  paradigm  adopted  for  this  study  was  a  model  for 
motion  perception  where  the  main  task  was  viewed  as  the  computation  of 
sets  of  parameters  that  defined  a  hierarchy  of  abstraction  levels.  The 
parameters  at  any  level  can  be  thought  of  as  succinct  representations  for  the 
invariances  that  characterize  that  level.  The  computations  performed  at  the 
proposed  levels  of  the  hierarchy  span  Low  Level  to  Intermediate  Level  visual 
processing  tasks. 

The  concept  of  spatial  receptive  field  (SRF)  of  the  parameters  at  any 
level  of  the  hierarchy  was  introduced  to  capture  the  notion  of  the  degree  of 
abstraction  realized  by  the  parameters  at  that  level.  Thus  at  lower  levels  of 
the  hierarchy,  involving  the  computation  of  optical  flow  for  instance,  the 
SRF  is  small.  In  contrast,  at  the  higher  levels  of  abstraction,  an  example  of 
parameter  type  could  be  rigid  body  translational  and  rotational  velocities, 
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which  have  larger  SRF.  In  the  latter  case,  the  SRF  is  global  in  the  sense  that 
it  spans  the  part  of  the  visual  field  that  receives  input  from  an  entire  moving 
surface. 

The  study  was  divided  into  three  main  stages.  The  first  stage  had  to  do 
with  the  measurement  of  image  motion.  It  was  demonstrated  that  clustering 
is  a  powerful  tool  in  determining  and  structuring  the  image  motion  field. 
An  orthogonal  image  decomposition  scheme  was  also  introduced  to 
determine  match  tokens  in  time  varying  intensity  images. 

The  second  part  dealt  with  the  analysis  of  the  algebraic  constraints 
between  the  image  motion  field  and  the  rigid  body  motion  parameters. 
Here,  the  solution  of  two  open  problems  were  derived.  These  had  to  do 
with  the  upper  bound  on  the  number  of  interpretations  of  the  optical  flow 
field,  which  was  proved  to  be  three,  and  the  conditions  under  which  unique 
interpretation  is  possible.  Regarding  the  latter  question  it  was  proved  that  the 
condition  of  ambiguity  can  be  resolved  by  making  observations  at  more  than 
two  time  instants. 

Finally,  the  analysis  of  the  previous  part  were  utilized  to  design 
algorithms  for  estimating  the  motion  parameters  from  image  motion,  using 
the  hough  transform  technique. 

It  was  seen  that  nonlinearity  and  large  dimensionality  of  the  parameter 
space  were  two  obstacles  to  the  solution  of  the  motion  perception  problem. 
In  chapter  five,  alternative  Active ,  strategies  were  proposed  to  tackle  these 
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difficulties.  The  notion  of  tracking  was  introduced  in  analogy  with  the 
human  smooth  pursuit  system.  It  was  shown  that  under  this  active  scheme 
the  motion  parameter  estimation  problem  becomes  simpler.  In  the  case  of 
egomotion,  when  the  observer’s  motion  is  steady  in  space  over  the  period 
of  observation,  it  is  possible  to  obtain  closed  form  solutions  for  the 
egomotion  parameters.  In  general,  when  the  assumption  of  steady  motion 
does  not  hold,  the  above  parameter  estimate  degrades.  However,  the 
tracking  solution  is  proposed  as  one  of  the  modules  in  a  cooperative  stereo¬ 
motion  (see  figure  6.1)  system. 

In  this  cooperative  scheme,  the  motion  and  stereo  correspondence  is 
aided  by  the  initial  parameter  estimate.  Future  research  will  determine  the 
efficacy  of  this  approach,  although  as  far  as  motion  and  structure 
interpretation  from  optical  flow  is  concerned  it  forms  a  vital  link  in  the 
proposed  hierarchical  motion  perception  model  (figure  1.1),  since  good 
initial  estimates  of  the  parameter  values,  as  seen  in  chapter  four,  greatly 
simplifies  the  task  of  rigid  body  motion  parameter  computation. 

6.2.  Future  Work 

The  research  reported  in  this  thesis,  has  opened  up  several  avenues  for 
further  study.  The  task  of  motion  perception  is  seemingly  complex.  The 
model  proposed  here  is  based  on  the  belief  that  highly  parallel  hierarchies  of 
simple  local  interactions  is  in  principle  upto  the  task.  This  belief  is  bolstered 
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Motion  Parameters 


Figure  6.1  A  cooperative  model  for  motion  perception 

by  the  results  demonstrated  here  and  by  the  increasingly  better 
understanding  of  the  biological  visual  mechanisms  evolving  from  research  in 
psychology  and  the  neural  sciences. 

Areas  in  which  future  research  in  computational  studies  in  motion 
perception  may  be  directed  include: 

(i)  Correspondence  using  labeled  interest  points.  The  orthogonal 
decomposition  operator  provides  a  a  natural  way  of  labeling  the  selected 
interest  points.  These  labels  when  used  explicitly  may  reduce  the 
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average  number  of  match  possibilities  for  the  image  motion  algorithm. 

(ii)  Cluster  Analysis  for  multiple  time  frames .  The  cluster  based  approach  to 
image  motion  analysis  is  extendible  to  multiple  time  frames.  This 
needs  to  be  analyzed  and  experimentally  evaluated. 

(iii)  Limited  Spatial  indexing  of  the  velocity  space. :  The  clustering  method  is 
oblivious  to  the  spatial  coherence  of  the  match  data,  only  dealing  with 
match  vector  or  velocity  values.  It  is  easy  to  extend  the  clustering 
scheme  to  include  some  measure  of  spatial  coherence  like  the  principal 
components  of  the  two  dimensional  spatial  scatter.  The  details  of  such  a 
strategy  needs  to  be  worked  out  and  implemented. 

(iv)  Alternative  Clustering  strategies:  A  controlled  probabilistic  relaxation 
technique  seems  possible.  Maximizing  entropy  or  energy  functionals 
are  good  candidates.  This  needs  further  study. 

(v)  Sampling  and  quantization  of  the  parameter  space:  Uniform  sampling  and 
quantization  need  not  be  the  only  solution.  The  sampling  properties  of 
the  motion  parameter  space  needs  to  be  studied  to  determine  whether, 
for  instance,  random  location  and  sizes  of  the  cells  lead  to  economy 
and  efficiency  without  sharp  decline  in  performance. 


(vi)  Tracking  target  selection:  Aggregating  many  coherently  moving  features 
may  help.  The  exact  mechanism  for  doing  this  needs  to  be  examined. 
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Appendix  A 


Uniqueness  of  Rigid  Motion  Parameters 


Consider  a  point  P  in  space  whose  coordinates  are  ( X,Y,Z )  with  respect 
to  a  fixed  inertial  frame  XYZ.  The  image  of  this  point  is  p  =(*,*)  whose 
coordinates  are  given  with  respect  to  a  xy  frame  located  on  the  image  plane. 
The  relation  between  the  world  point  P  and  the  image  point  p  is  given  by 

(...)  (0 

where  T  is  the  focal  length  of  the  imaging  system.  This  is  assumed  to  be 

unity  in  the  following  analysis. 

Now  if  a  rigid  surface  moves  with  a  translational  velocity  VT  =(U,V,W)  and 
a  rotational  velocity  n  —(a,0,-i).  Then,  from  kinematics,  the  three 
dimensional  velocity  of  any  point  on  the  surface  can  be  written  as 

where  ‘t’  is  the  time  variable  and  ‘x’  denotes  vector  product. 

In  differential  motion  case  the  image  motion  or  optical  flow  is  denoted  by 

(u,t»)  =(  Differentiating  equation  (1)  and  substituting  from  equation 

(2)  we  have  the  following  relations 
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%  = 


X  IT 


-  a**  +  0(i*  +  1)  -  i> 


:•? 

r 

V 


.to 


*  =  V  -  «(»*  +  1)  +  +  7* 

Eliminating  the  unknown  depth  variable  from  the  above  we  get 

•  +  axy  -  filx*  +  1)  +  7r  U  - 


9  +  a(jf*  +  1)  -  fixy  -  7*  V  -  y W 
The  above  equation  describes  the  constraint  imposed  by  the  measured  value 

of  optical  flow  («,v),  at  an  image  point  (x,y),  on  the  six  motion  parameters 

(U,V,W,a,0n). 


Proposition  I.  Given  the  rotation  parameters  the  translation  parameters  can  be 
uniquely  determined  from  the  optical  flow  field 

Proof:  First  we  define  the  function  p(x,y)  where, 

„  -  •  +  **y  ?•*  V*  * T*  ov) 

*  +  a()T  +  1)  -  fixy  -  7*  v  ' 

Now  we  analyse  the  following  cases: 

Case  1:  If  p  =  comtant  then  from  equation  (iv)  we  have  W  =0.  In  this  case 
we  can  only  obtain  the  ratio  from  the  optical  flow  field. 

Case  2:  It  p  eorutant  then  there  are  two  image  points  where  p  is  different. 
In  which  case  we  can  solve  the  resultant  set  of  two  linear  equations, 

obtained  from  (iv),  to  get  x0  and  y0  =-^r. 

Vv  w 

Proposition  II.  Given  the  translation  parameters  the  rotation  parameters  ean  be 
uniquely  determined  from  optical  flow. 


Proof:  Here  the  values  of  z0  and  Vo  are  known.  The  expression  for  optical 
flow  is, 

«  =(*o  -  *)4  -  axy  +  0(x*  +  1)  -  -rjf 
»  =(lfo  -  t)+  -  «(»*  +  1)  +  0xy  +  7* 

w 

Where  (a,0f~i)  are  the  rotation  parameters  and  i  —  —  is  the  reciprocal  of  the 

« 

scaled  depth  function.  If  possible  let  there  be  another  surface  moving  with 
the  same  translation  but  different  rotation  parameters,  but  generating  the 
same  otical  flow.  Thus  we  have, 

«  =(*o  -  *)^'  -  a'xy  +  ?(x*  +  1)  -  7 '» 

•  =(*0  -  Jf)^'  -  +  1)  +  0'*y  +  7'* 

Now  from  the  above  sets  of  equations  by  subtracting  appropriately  we  get, 

0  =(*o  -  *)(^  -  4')  -  Aazy  +  A?(za  +  1)  -  Ary  (v.a) 

0  =(Vo  ~  V){+  ~  4')  -  Aa(ya  +  1)  +  A0xy  +  A7*  (v.b) 

where  Aor  =a  -  a',  A0  =0  -  0'  and  A7  =7  -  7'.  Eliminating  (4  -  4')  from  the 

above  we  have, 

(Aaz0  +  A0yo)  -  z(A7Z0  +  Aa)  -  y(A7y0  +  A0)  +  xa(A0yo  +  A7) 

+  ya(Adrz„  +  A7)  -  zy(A0zo  +  A ay0)  =  0  (vi) 

Since  the  above  equation  is  valid  everywhere  in  the  image, 

Adrz0  +  A0yo  =  0  A ay0  +  A 0zo  =  0 

A 7Z0  +  A a  =0  A0Vo  +  A7  =0 

AtVo  +  A/?  =0  Aax0  +  A7  =0 

From  the  above  we  obtain, 
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This  means  that  a  =<*',  —/)'  and  7  =7'  and  therefore,  the  rotation 

parameters  are  uniquely  determined  when  the  translation  parameters  are 
known. 

Proposition  III  If  the  structure  of  a  Rigidly  moving  surface  is  known,  then  the 
parameters  describing  its  motion  is  uniquely  determined. 

Proof:  Knowing  structure  means  that  we  have  the  depth  values  available  up 
to  some  scale  factor.  Thus  in  equation  (iii)  the  value  ’Z’  is  no  longer  an 
unknown.  The  unknown  scale  factor  is  lumped  with  the  translation 
parameters.  Now  proceeding  in  a  manner  analogous  to  the  previous  proof 


we  have, 


-^{hU  -  sAW)  —hazy  -  A0(z3  +  1)  +  brjy 

-^-(AV  -  yAW)  =  Aa(y*  +  1)  -  A 0zy  -  A7* 
z 


(vii.a) 

(vii.b) 


Eliminating  we  have, 
z 


( AaAC/  +  hfihV)  -  x[biAU  +  AaAW)  -  y(A0AW  +  A7AV) 

+  *8(A7AW  +  A0AV)  +  y*(  A7AW  +  AaAtf)  -  lyfAaAV  +  A0A17) 

Since  the  above  equation  must  be  valid  all  over  the  image  plane,  the 

following  relations  hold: 

AaA£/  +  AflAV  =0  AaAW  +  At  At/  =0  A/JAW  +  AtAV  =0 

AaAV  +  A0A  U  =0  A0AV  +  A7AW  =0  AaAC/  +  A7AW  =0 

From  eqn.  (vii)  and  the  above  relations  we  have, 

A  U  =  AV  =  AW  =Aa  =  A0  =A-)  =0 


Therefore,  once  the  structure  is  known  for  a  rigidly  moving  surface,  its 
translation  (  up  to  a  scale  factor  )  and  its  rotation  is  determined  uniquely 
from  the  optical  flow  generated  by  the  motion. 


Appendix  B 


Representation  of  surface  shape 


In  computer  vision,  the  terms  surface  orientation  map  and  shape  are 
sometimes  used  interchangably.  The  following  is  an  attempt  to  explain  the 
basis  of  this  usage.  The  cases  of  Perspective  as  well  as  Orthographic 
projections  are  considered.  Shape  information  obtainable  from  a  surface 
orientation  map  in  image  coordinates  is  also  explored. 

Representations  for  surface  orientation 

A  direction  in  three  space  is  specified  by  two  independent  parameters. 

A-  (Latitude,  Longitude):  The  coordinates  are  denoted  by  (#,^)  where 

0<l<»  ,  0<^<f. 

B.  Coordinates  on  the  gaussian  (or  unit  radius)  sphere.  If  the  coordinates 
are  (J,m,n)  then  I2  +  m2  -t-  n2  =1. 

C.  (slant  ,  tilt):  Slant  is  the  tangent  of  ther  latitude  angle  (or  t&nf  )  while 
tilt  'is  the  longitude  angle.  The  symbolic  notation  is  (<r,r). 

D.  (Gradient):  If  the  depth  is  expressed  in  the  form  Z  =f(X,Y),  then  it  is 
the  level  surface  F(X,Y,Z)  =0,  where 
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F(X,Y,Z)  =f(X,Y)-  z 

The  gradient  of  /’(•)  ,  i.e.  (44*44*-l)  gives  the  orientation  of  the 

aX  oY 

surface  (  in  the  direction  of  increasing  /*(•)  ).  The  gradient  notation  b 
written  as  (/»,*),  where  =(|j.|p)- 

Relationship  among  the  surface  normal  representations: 

y/p1  +  =t*n#  =<r 

—  —  tan  p  =  Unr 

P 

=(7. 7. -7-).  9  =Vpr+  q*  +  1 

9  9  9 

Shape  under  Perspective  Projection 

In  the  case  of  perspective  projection  the  relation  between  a  world  point 
[X,Y,Z)  and  its  projection  (*,*)  in  the  image  plane  b  given  by 

1  \  t  FX  FY >  /  .\ 

(*»*)  (0 

where  F  b  the  focal  length  of  the  imaging  system. 

The  surface  is  represented  in  the  world  frame  by  the  functional  form 
Z(X,  Yj.  It  is  assumed  that  the  surface  can  also  be  represented  (at  least 
locally)  by  the  function  *[x,y)  in  image  coordinates.  Here  the  relation 

dZ  dZ 

between  the  surface  normals  (-r^,-^)  corresponding  to  an  image  point  (x,y) 

dX  0  Y 

and  the  partial  derivatives  of  *(*,y)  are  saught. 

A.  Relationship  between  surface  gradients  in  image  and  world  coordinates.  Now 


a  small  displacement  [6x,6y)  in  the  image  plane  corresponds  to  a 


displacement  (6X,6Y,6Z)  in  the  world  frame,  along  the  surface  Z(X,Y). 
From  equation  (i)  we  get  the  relation 


g  jp-  __  6  xZ  +  x  f  Z 
F 

f  v $  yZ  -f  yS Z 

SY- - - - 

Furthermore  the  following  identity  holds 

Z(X+£X ,  Y+SY)  =x(x  +6x  ,y +Sy)  ( iil) 

Using  the  Taylor  series  expansion  of  the  above 

Z(X  +  SX  ,  Y  +  SY)  =Z(X,Y)  +  +  (higher  order  terms)  (iv.a) 

*(*  +  Sx  ,  y  +  Sy )  =Z(x,y)  +  $x-^—  +  +  (higher  order  terms)  (iv.b) 

Neglecting  the  higher  order  terms  in  equation  (iv)  and  substituting  for  SX 
and  SY  from  equation  (ii)  in  equation  (iv.a) 

Z(X  +  SX  ,  Y  +  S Y)  -  Z(X,Y)  =SZ  =±(SxZ  +  xSZ)  J|r  +  ±(6yZ  +  ySZ) 
or 


(ii.a) 

(ii.b) 


srtw  dZ  dZ.  dZ  ,  „r  dZ 

SZ(F  -  x—  -  Ifyp)  =*Sx—  +  ZSy— 


Recall  now  that 


Z(X  +  SX  ,  Y  +  SY)  -  Z(X,Y)  =*(*  +Sx  ,  y  +  Sy)  -  x(x,y) 
Therefore  combining  equations  (iii),  (iv)  and  (v) 


(v) 


S  z- 


dZ 


F_  2L  a*  '  " F_  Jz 

dY 


+  S  y- 


Z 


_ dZ  .  dx  c  dx 

dZ_  dY  ~6*dx  +  6vJi  (vi) 


dX  dY  dX  dY 

Since  Sx  and  Sy  are  independent  of  each  other  we  have 
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dx 

dx 


2 


dZ 

ax 


F  -  x 


3Z 

ax 


az 

dY 


dx 

Sy 


Z 


az 

dY 


az 

lx 


az 

dY 


(vii.a) 

(vii.b) 


B.  What  Shape  means  Consider  the  shape  information  available  from  the 
field  of  surface  normals  indexed  by  the  image  coordinates.  Making  the 
appropriate  substitutions  from  equations  (vii)  in  equation  (iv.b)  we  have: 


’.V  +  SJ!l-l+sX- 


az 

ax 


F  -  x 


az  az 


+  6  y- 


az 

dY 


ax  dY 

Thus  the  following  statement  can  be  made: 


F  -  x 


az  az 
ax~  by 


Under  perspective  projection,  when  the  field  of  surface  normals  is  available, 
indexed  by  image  coordinates,  then  the  image  centered  depth  function  can  be 
computed  upto  a  dilation  factor. 


Lemma  I.  If  the  surface  Z  is  represented  by  an  algebraic  function  Z(X,Y)  and 
furthermore  if  the  function  *(*,»)  denotes  the  same  surface  in  terms  of  the 
image  coordinates  (*,y),  then  the  tilt  function  r(*,y)  is  given  by 

dZ  dx 

__  ax  _  dx 

T  iL  ii 

dY  dy 

Proof:  Since  Z{X,Y)  is  an  algebraic  function,  by  definition  it  can  be 


expressed  implicitly  by  the  polynomial  equation  F{X,Y,Z)  =0.  We  can  write 
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F(  •)  as 


=o  (viii) 

where  the  e^’s  are  real  constants  and  L,  M,  N  are  finite  positive  integers. 
By  using  the  implicit  function  theorem  we  get 


dZ  Fr 

_  dY  Ft  _  Fr 

r  az_  Fx  Fx 

ax  -77 

where  Fx,Fy,Fx  denote  the  partial  derivative  of  F(')  with  respect  to  I.f  and 
Z .  Therefore  we  have  from  equation  (viii): 


r 


L  M  M 

E  EEftn**"1** 

i- 

L  It  H 

EEE^'1^* 


(ix) 


Observe  now  that  we  can  obtain  an  implicit  representation  for  the  depth  in 


terms  of  the  image  coordinates  (x,y)  from  equation  (viii)  by  substituting  for 

•  JC  Y 

X  and  Y  in  accordance  with  *  =—  and  y  =  —  (where  the  focal  length  is 


assumed  to  be  l).  Thus  we  obtain  the  representation  G(x,y,i)  =0  or 


EEEe**V*,+/+*  =° 

i«0i«=0*«0 


Again  by  the  implicit  function  theorem  we  have 


(x) 


3* 

dy 

dz 

dz 


EEE;e,v**iy>-V+'+k 

1*0;  =14  *0 
___ 

EE  E^ijk1''  1y'V+,+1 

I  **J  ;  =0  k  =0 


or 
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if.  EEE***V-‘*'*'**-‘ 

dy  i«4HM 
gm  IMS 

Jf-  EEEv'-W'*1"1 

Consider  now,  equation  (ix)  and  substitute  *  =«  and  y  =  y* 


i£  EEE*»*V-v+'+‘-‘ 

3  y  ia<Ht  >0 
“37  L  M  N 

Jj  EEE^I‘'V,i+,+‘'1 


(*0 


(xU) 


But  the  right  hand  sides  of  the  equations  (xi)  and  (xii)  are  identical.  This 


means, 


dZ  dt_ 
dY  dy 
T  dZ  dx 

ax  ax 

which  concludes  the  proof  of  the  lemma. 


Shape  under  Orthographic  Projection: 

Under  orthography  the  image  coordinates  of  a  point  are  equal  to  the 
corresponding  three  dimensional  coordinates,  or 


(z,y)  =  (X,Y) 


Thus 


.  az_  az .  _ .  az_  az . 

1  ax  ’  ay  ’  ax’  aY' 

Now  observe  from  equation  (iv.a)  that  when  the  surface  normals  are  known 
at  an  image  point  ( * , y ) ,  then  the  depth  difference  between  this  point  and 
neighbouring  image  points  are  known: 


Z(X+SX,  Y+SY)-  Z(X,Y)  ^SX~  +  fY^~-  +  (higher  order  term,) 

oX  o  Y 

Thus  we  can  state  the  following: 

When  a  map  of  surface  normals  is  available  under  orthography,  the  depth 
function  can  be  computed  upto  a  constant  additive  term. 


